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Abstract 


Graphs  show  up  in  a  surprisingly  diverse  set  of  disciplines,  ranging  from  com¬ 
puter  networks  to  sociology,  biology,  ecology  and  many  more.  How  do  such  “nor¬ 
mal”  graphs  look  like?  How  can  we  spot  abnormal  subgraphs  within  them?  Which 
nodes/edges  are  “suspicious?”  How  does  a  virus  spread  over  a  graph?  Answering 
these  questions  is  vital  for  outlier  detection  (such  as  terrorist  cells,  money  launder¬ 
ing  rings),  forecasting,  simulations  (how  well  will  a  new  protocol  work  on  a  realistic 
computer  network?),  immunization  campaigns  and  many  other  applications. 

We  attempt  to  answer  these  questions  in  two  parts.  First,  we  answer  questions 
targeted  at  applications :  what  patterns/properties  of  a  graph  are  important  for  solving 
specific  problems?  Here,  we  investigate  the  propagation  behavior  of  a  computer  virus 
over  a  network,  and  find  a  simple  formula  for  the  epidemic  threshold  (beyond  which 
any  viral  outbreak  might  become  an  epidemic).  We  find  an  “information  survival 
threshold”  which  determines  whether,  in  a  sensor  or  P2P  network  with  failing  nodes 
and  links,  a  piece  of  information  will  survive  or  not.  We  also  develop  a  scalable, 
parameter-free  method  for  finding  groups  of  “similar”  nodes  in  a  graph,  corresponding 
to  homogeneous  regions  (or  CrossAssociations)  in  the  binary  adjacency  matrix  of  the 
graph.  This  can  help  navigate  the  structure  of  the  graph,  and  find  un-obvious  patterns. 

In  the  second  part  of  our  work,  we  investigate  recurring  patterns  in  real-world 
graphs,  to  gain  a  deeper  understanding  of  their  structure.  This  leads  to  the  development 
of  the  R-MAT  model  of  graph  generation  for  creating  synthetic  but  “realistic”  graphs, 
which  match  many  of  the  patterns  found  in  real-world  graphs,  including  power-law 
and  lognormal  degree  distributions,  small  diameter  and  “community”  effects. 
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Chapter  1 
Introduction 


Informally,  a  graph  is  a  set  of  nodes,  and  a  set  of  edges  connecting  some  node  pairs.  In  database 
terminology,  the  nodes  represent  individual  entities,  while  the  edges  represent  relationships  be¬ 
tween  these  entities.  This  formulation  is  very  general  and  intuitive,  which  accounts  for  the  wide 
variety  of  real-world  datasets  which  can  be  easily  expressed  as  graphs.  Some  examples  include: 

•  Computer  Networks:  The  Internet  topology  (at  both  the  Router  and  the  Autonomous  System 
(AS)  levels)  is  a  graph,  with  edges  connecting  pairs  of  routers/AS.  This  is  a  self-graph,  which 
can  be  both  weighted  or  unweighted. 

•  Ecology:  Food  webs  are  self-graphs  with  each  node  representing  a  species,  and  the  species 
at  one  endpoint  of  an  edge  eats  the  species  at  the  other  endpoint. 

•  Biology:  Protein  interaction  networks  link  two  proteins  if  both  are  necessary  for  some  bio¬ 
logical  process  to  occur. 

•  Sociology:  Individuals  are  the  nodes  in  a  social  network  representing  ties  (with  labels  such 
as  friendship,  business  relationship,  trust,  etc.)  between  people. 

•  User  Psychology:  Clickstream  graphs  are  bipartite  graphs  connecting  Internet  users  to  the 
websites  they  visit,  thus  encoding  some  information  about  the  psyche  of  the  web  user. 

•  Information  Retrieval:  We  can  have  bipartite  graphs  connecting  “document”  nodes  to  the 
“word”  nodes  that  are  present  in  that  document.  This  link  could  be  weighted  by  the  num¬ 
ber  of  occurrences  of  the  word  in  the  document.  We  also  have  co-authorship  self-graphs, 
connecting  authors  who  were  co-authors  of  some  document. 

In  fact,  any  information  relating  different  entities  (an  M  :  N  relationship  in  database  terminology) 
can  be  thought  of  as  a  graph.  This  accounts  for  the  abundance  of  graphs  in  so  many  diverse  topics 
of  interest,  most  of  them  large  and  sparse.  Figure  1.1  shows  some  such  graphs.  In  our  work,  we 
focus  on  unweighted,  unlabeled  graphs,  of  both  the  self-  and  bipartite-  graph  types. 

Given  the  widespread  presence  of  such  “graph”  datasets,  one  obvious  question  becomes  very 
important: 
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(a)  Food  web 


(d)  Friendship 
network 


(b)  Internet  map 


(c)  Protein  interactions 


(e)  Needle- sharing 
network 


(f)  Hijackers 
network 


Figure  1.1:  Examples  of  graphs:  (a)  Food  web  [Martinez,  1991]  (b)  Internet  topology  map 
(lumeta.com)  (c)  Protein  interaction  network  (genomebiology.com)  (d)  Friendship  net¬ 
work  among  students  in  one  American  school  [Moody,  2001]  (e)  “Needle-sharing”  network  of 
drug  users  [Weeks  et  ah,  2002]  (f)  Network  of  the  September-11  hijackers  [Krebs,  2001]. 
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How  can  we  efficiently  and  accurately  analyze  large  graph  datasets,  and 
mine  them  for  interesting  information? 


Thanks  to  the  generality  of  the  graph  representation  of  data,  any  new  graph  mining  algorithm  can 
be  applied  to  a  variety  of  datasets,  perhaps  obtained  from  completely  different  fields.  Mining  this 
information  can  provide  significant  benefits. 

•  Security:  The  “COPLINK  Connect”  system  [  hen  et  a  ,  ]  has  been  used  to  combine 

and  share  data  regarding  criminals  among  many  police  departments.  The  criminal  graph 
links  suspects,  crimes,  locations,  previous  case  histories,  etc.  These  linkages  provide  the 
necessary  information  for  effective  law  enforcement.  Graph  mining  algorithms  can  also  be 
used  for  finding  abnormal  subgraphs,  say  a  money-laundering  ring,  in  a  large  social  network 
of  financial  transactions. 

•  The  World-wide  Web:  To  provide  good  results,  a  search  engine  must  detect  and  counteract 
the  “outliers”  on  the  Web:  spam  sites,  “googlebombers”  [  Toogle  Boml  ]  and  the  like.  Graph 
mining  techniques  are  needed  to  automatically  find  such  webpages  out  of  the  billions  on  the 
WWW. 

•  Drug  discovery:  Graph  mining  techniques  can  be  used  to  search  for  frequent  subgraphs  in 

large  chemical  databases.  For  example,  Dehaspe  et  al.  [  >ehaspe  and  Toivonen,  1999]  look 
for  patterns  that  regularly  in  carcinogenic  compounds,  and  Milo  et  al.  [  lilo  et  a]  ,  ] 

detect  “motifs”  in  genetic  and  other  networks.  Any  information  gleaned  from  such  studies 
can  be  very  useful  in  understanding  diseases  and  developing  drugs  to  counter  them. 

•  Immunizations:  Starting  from  one  infected  individual,  a  contagious  disease  can  spread  across 
a  population  (or  computer  network,  for  a  computer  virus)  by  infecting  healthy  individuals 
who  are  linked  to  infected  ones.  Thus,  the  structure  of  the  “social  network”  plays  a  crit¬ 
ical  role  in  the  spread  of  the  disease,  and  in  choosing  the  nodes  whose  immunization  can 
“maximally  disrupt”  the  contagion  [  5astor-Satorras  and  Vespignani,  2001  ]. 

•  Viral  marketing:  This  is  exactly  the  opposite  of  immunizations:  the  aim  is  to  target  indi¬ 
viduals  who  are  best  positioned  in  the  social  network  to  generate  “word-of-mouth”  public¬ 
ity  [Richardson  and  Domingos,  2002], 

•  Biology:  Understanding  the  networks  of  regulatory  genes  and  interacting  proteins  can  give 

insights  into  the  biological  functions  of  these  genes  and  proteins  [  iarabas  ,  ]. 

Thus,  there  are  many  real-world  applications  where,  given  a  specific  graph,  we  want  to  find  some 
properties  or  answer  a  particular  query  about  it. 

As  against  this,  we  can  ask  questions  about  real-world  graphs  in  general,  and  not  about  any  one 
graph  in  particular.  We  could  ask: 


What  patterns  occur  regularly  in  real-world  graphs?  How  do  these  graphs 
evolve  over  time?  How  fast  do  they  grow? 
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The  answers  to  these  questions  would  be  the  “marks  of  realism”  in  real-world  graphs,  and  we  could 
use  them  to  detect  “fake”  graphs  or  “abnormal”  subgraphs.  Knowledge  of  these  patterns  is  also 
essential  for  another  problem: 


How  can  we  build  “realistic”  yet  efficient  graph  generators? 


The  criterion  forjudging  the  quality  of  a  graph  generator  is  the  “realism”  of  the  generated  graphs: 
they  are  realistic  exactly  if  they  match  the  common  graph  patterns.  Graph  generators  can  in  turn 
be  used  for  a  variety  of  purposes —  graph  sampling,  graph  compression,  and  graph  extrapolation 
for  simulation  studies,  to  name  a  few. 

Thus,  we  observe  a  dichotomy  in  graph  mining  applications:  we  can  answer  specific  queries 
on  a  particular  graph,  or  we  can  ask  questions  pertaining  to  real-world  graphs  in  general.  The 
separation  between  the  two  is,  however,  not  strict,  and  there  are  applications  requiring  tools  from 
both  sides.  For  example,  in  order  to  answer  “how  quickly  will  viruses  spread  on  the  Internet 
five  years  in  the  future,”  we  must  have  models  for  how  the  Internet  will  grow,  how  to  generate 
a  synthetic  yet  realistic  graph  of  that  size,  and  how  to  estimate  the  spread  of  viral  infections  on 
graphs. 

In  my  thesis,  I  explore  issues  from  both  sides  of  this  dichotomy.  Since  graphs  are  so  general, 
these  problems  have  been  studied  in  several  different  communities,  including  computer  science, 
physics,  mathematics,  physics  and  sociology.  Often,  this  has  led  to  independent  rediscovery  of  the 
same  concepts  in  different  communities.  In  my  work,  I  have  attempted  to  combine  these  viewpoints 
and  then  improve  upon  them. 

The  specific  problems  I  investigated  in  my  research  are  as  follows.  Chapter  2  contains  basic 
background  material.  Key  symbols  and  terms  used  throughout  the  document  are  defined  here.  Any 
extra  background  material  specific  to  each  topic  will  be  presented  in  the  corresponding  chapter. 
The  next  three  chapters  investigate  applications  of  graph  mining  on  specific  graphs.  In  chapter  3, 
we  analyze  the  problem  of  viral  propagation  in  networks:  “Will  a  viral  outbreak  on  a  computer 
network  spread  to  epidemic  proportions,  or  will  it  quickly  die  out?  ”  We  investigate  the  dependence 
of  viral  propagation  on  the  network  topology,  and  derive  a  simple  and  accurate  epidemic  threshold 
that  determines  if  a  viral  outbreak  will  die  out  quickly,  or  survive  for  long  in  the  network.  In 
chapter  4,  we  study  information  survival  in  sensor  networks:  Consider  a  piece  of  information  being 
spread  within  a  sensor  or  P2P  network  with  failing  links  and  nodes.  What  conditions  on  network 
properties  determine  if  the  information  will  survive  in  the  network  for  long,  or  die  out  quickly?  In 
chapter  5,  we  answer  the  question:  How  can  we  automatically  find  natural  node  groups  in  a  large 
graph?  Our  emphasis  here  is  on  a  completely  automatic  and  scalable  system:  the  user  only  needs 
to  feed  in  the  graph  dataset,  and  our  Cross-associations  algorithm  determines  both  the  number  of 
clusters  and  their  memberships.  In  addition,  we  present  automatic  methods  for  detecting  outlier 
edges  and  for  computing  “inter-cluster  distances.” 

Next,  in  chapter  6,  we  discuss  issues  regarding  real-world  graphs  in  general:  How  can  we 
quickly  generate  a  synthetic  yet  realistic  graph?  How  can  we  spot  fake  graphs  and  outliers?  We 
discuss  several  common  graph  patterns,  and  then  present  our  R-MAT  graph  generator,  which  can 
match  almost  all  these  patterns  using  a  very  simple  3-parameter  model.  Finally,  chapter  7  presents 
the  conclusions  of  this  thesis,  followed  by  a  discussion  of  future  work  in  chapter  8. 
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Chapter  2 

Basic  Background  Material 


Before  we  discuss  our  graph  mining  tools,  we  must  develop  some  basic  terminology.  We  will  start 
with  the  definition  of  a  graph,  and  of  the  different  types  of  graphs.  We  will  then  define  the  degree 
distribution  of  a  graph,  and  briefly  discuss  “power  laws.”  We  will  see  the  simple  random  graph 
model,  which  we  will  discuss  in  detail  later  in  Chapter  6.  Finally,  we  will  define  singular  values 
and  eigenvalues,  which  are  extremely  useful  in  the  analysis  of  matrices  and  graphs,  and  will  show 
up  in  later  chapters. 


2.1  Graph  Types 

The  informal  definition  of  a  graph  is  a  set  of  nodes  with  edges  connecting  some  of  them.  However, 
there  can  be  several  variations  on  this  theme,  each  of  which  has  a  special  name.  The  graph  can  be 
directed  (when  edges  point  from  one  node  to  another)  or  undirected  (edges  point  both  ways).  A 
graph  can  be  a  self-graph  or  a  bipartite  graph  depending  on  whether  the  set  of  nodes  pointed  to  by 
edges  is  the  same  as  the  set  of  nodes  pointed  from,  or  not.  In  addition,  edges  can  have  different 
weights,  or  not,  which  differentiates  weighted  graphs  from  unweighted  graphs.  All  of  these  terms 
are  formally  defined  below;  figure  2.1  shows  examples  of  these. 

Definition  1  (Graph  or  Self-graph).  A  graph  Q  =  (V,  £)  is  a  set  V  of  N  nodes,  and  a  set  £  of  E 
edges  between  them.  The  number  of  nodes  is  denoted  by  N  =  ||V||,  and  the  number  of  edges  by 
E=\\£\\. 

Definition  2  (Bipartite  graph).  In  a  bipartite  graph,  the  set  of  nodes  V  consists  of  two  disjoint 
sets  of  nodes  V\  and  V2:  V  =  Vi  U  V2.  Any  edge  connects  a  node  in  Vi  to  a  node  in  Vo,  that  is, 
(i,  j)  G  £  =>-  i  G  Vi  and  j  G  Vo.  The  number  of  nodes  in  the  two  node  set  are  N}  =  ||Vi||  and 
N2  =  1 1 V2 1 1 .  The  number  of  edges  is  still  E  =  ||£||. 

Definition  3  (Directed  and  undirected  graph).  In  a  directed  graph  Q  =  (V,  £),  each  edge  (1,  j )  G 
£  points  from  node  i  to  node  j.  An  undirected  graph  is  a  directed  graph  where  edges  point  both 
ways,  that  is,  (i,j)  G  £  =>-  (j,  i)  G  £. 

Definition  4  (Weighted  and  unweighted  graphs).  A  weighted  graph  Q  =  (  V.  £.  >V)  has  a  set 

of  nodes  V,  and  a  set  of  edges  £;  and  VV  represents  the  corresponding  weights  of  those  edges. 
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Symbol 

Description 

G 

The  graph.  Q  =  (V,  8)  for  a  weighted  graph, 

and  Q  =  (V,  8,  VV)  for  an  unweighted  graph 

V 

The  set  of  nodes  in  G.  V  =  Vi  U  V2  for  a  bipartite  graph 

£ 

The  set  of  edges 

VV 

Edge  weights:  zero  when  no  edge  exists,  and  positive  when  it  does. 

Edges  weights  are  1  for  an  unweighted  graph. 

A 

The  adjacency  matrix  of  the  graph 

N 

Number  of  nodes  in  G 

nun2 

Number  of  nodes  in  the  partitions  Vi  and  V2  of  a  bipartite  graph  G 

E 

Number  of  edges  in  G 

Table  2.1:  Table  of  basic  symbols. 


(a)  Directed 
self-graph 


(b)  Undirected  (c)  Bipartite  (d)  Weighted 
self-graph  graph  self-graph 


Figure  2.1:  Examples  of  different  graph  types:  (a)  Edges  in  a  directed  graph  point  one  way,  such  as 
a  graph  of  papers  where  edges  imply  references  from  one  paper  to  another,  (b)  Undirected  graphs 
have  edges  pointing  both  ways,  such  as  a  social  graph  of  kinship  between  individuals.  Both  of 
these  represent  self-graphs,  (c)  A  bipartite  graph  has  edges  connecting  two  different  sets  of  nodes, 
such  as  movies  and  the  actors  who  acted  in  them,  (d)  A  weighted  graph  has  weights  for  each  edge, 
such  as  the  bandwidth  of  a  link  connecting  two  routers. 


Weighted  graphs  can  be  both  self-graphs  or  bipartite  graphs.  Unweighted  graphs  are  a  special 
case  of  weighted  graphs,  with  cdl  weights  set  to  1. 


Definition  5  (Adjacency  matrix  of  a  self-graph).  The  adjacency  matrix  A  of  a  weighted  self¬ 
graph  Q  =  (V,  8.  VV)  is  an  N  x  N  matrix  such  that 


Wi,j  if{i,j)e£ 
0  otherwise 


for  i,j  G  1 . . .  N 


The  adjacency  for  an  unweighted  self- graph  merely  replaces  whJ  by  1. 

Definition  6  (Adjacency  matrix  of  a  bipartite- graph).  The  adjacency  matrix  A  of  a  weighted 
bipartite- graph  Q  =  (V,  8.  IV)  with  V  =  Vi  U  V2  is  an  N\  x  N2  matrix  such  that 


'-1,3 


Wi,j 

0 


iei...iVi 
j  el...  n2 


if  ihj)  e  £ 

otherwise 
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for 


The  adjacency  for  an  unweighted  bipartite-graph  merely  replaces  wt.j  by  1. 

In  addition,  graph  nodes  and  edges  can  have  attached  labels  (i.e.,  categorical  values);  such 
graphs  are  called  labeled  graphs.  However,  our  work  has  focused  on  unlabeled  and  unweighted 
graphs,  both  self-  and  bipartite,  and  both  directed  and  undirected.  Table  2.1  lists  these  symbols. 


2.2  Degree  Distributions  and  Power  Laws 

One  basic  property  of  nodes  in  the  graph  is  their  degree.  We  will  define  this,  and  the  corresponding 
degree  distribution  of  a  graph.  Many  degree  distributions  in  the  real-world  follow  power  laws , 
which  we  will  touch  upon  next. 

Definition  7  (Node  Degree).  The  outdegree  of  node  i  in  graph  Q  =  (V,  £)  is  the  number  of  nodes 
it  points  to: 

Out-degree  dout(i)  =  \\j  |  (i,j)  G  S\\ 

Similarly,  the  indegree  is  the  number  of  nodes  pointing  to  node  i: 

In-degree  din(i)  =  \\j  |  ( j,i )  G  £\\ 

For  an  undirected  graph,  dout(i )  =  dm(i)  for  all  nodes  i. 

Definition  8  (Degree  Distribution).  The  degree  distribution  of  an  undirected  graph  is  a  plot  of 
the  count  Ck  of  nodes  with  degree  k,  versus  the  degree  k,  typically  on  a  log-log  scale. 

Occasionally,  the  fraction  is  used  instead  of  cp,  however,  this  merely  translates  the  log-log 
plot  downwards.  For  directed  graphs,  outdegree  and  indegree  distributions  are  defined  similarly. 

Definition  9  (Power  Law  Degree  Distributions).  A  graph  Q  has  a  power-law  degree  distribution 
if  the  number  of  nodes  Ck  with  degree  k  is  related  to  k  by  the  equation: 

Ck  =  c  •  (2.1) 

where  c  and  7  are  positive  constants.  In  other  words,  the  degree  distribution  looks  linear  when 
plotted  on  the  log-log  scale.  The  constant  7  is  often  called  the  power  law  exponent. 


2.3  The  Random  Graph  Model 

The  random  graph  model  (  Erdos  and  Renyi,  1960]  is  a  famous  and  influential  model  of  graph 
generation.  We  will  discuss  many  other  models  in  Chapter  6;  however,  this  model  will  be  needed 
earlier,  and  so  we  briefly  mention  it  below. 

Starting  with  a  (user-provided)  number  of  nodes  N,  the  random  graph  generator  [Erdos  and 
Renyi,  1960]  adds  an  edge  between  every  pair  of  distinct  nodes  with  a  (user-provided)  probability 
p.  This  simple  model  leads  to  a  surprising  list  of  properties,  including  phase  transitions  in  the  size 
of  the  largest  component,  and  diameter  logarithmic  in  graph  size.  Its  ease  of  analysis  has  proven  to 
be  very  useful  in  the  early  development  of  the  field.  Hence,  many  early  graph-mining  algorithms 
were  focused  on  random  graphs,  and  a  basic  idea  of  random  graphs  is  useful  when  comparing  our 
algorithms  against  previous  methods. 
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2.4  Singular  Value  Decomposition 


The  technique  of  Singular  Value  Decomposition  (SVD)  is  extremely  useful  in  dealing  with  matri¬ 
ces.  We  will  just  present  the  basic  concept  here  [Press  et  ah,  1992]:  “Any  M  x  N  matrix  A  whose 
number  of  rows  M  is  greater  than  or  equal  to  its  number  of  columns  N,  can  be  written  as  the 
product  of  an  M  x  N  column-orthogonal  matrix  U,  an  N  x  N  diagonal  matrix  W  with  positive 
or  zero  elements  (the  singular  values),  and  the  transpose  of  an  A  x  N  orthogonal  matrix  V.” 


/  \ 


A 


V  7 


/  \ 


u 


V  7 


w  1 


w2 


\ 


WN 


(  \ 


V* 


V  7 


(2.2) 


This  decomposition  can  always  be  done;  it  provides  an  orthonormal  basis  for  the  rows  and  columns 
of  the  matrix  A;  it  is  useful  for  obtaining  least-squares  fits  to  some  matrix  equations,  for  approxi¬ 
mating  matrices  with  just  a  few  singular  values,  and  for  many  other  problems. 


2.5  Eigenvalues 

An  N  x  N  matrix  A  is  said  to  have  an  eigenvector  T  and  a  corresponding  eigenvalue  A  if  [Press 

et  al.,  1992] 

AT  =  AT  (2.3) 

These  again  provide  an  orthonormal  basis  for  the  matrix  (indeed,  eigenvalues  and  singular 
values  are  closely  related),  and  they  often  show  up  in  the  analysis  of  matrices,  as  we  shall  see  in 
later  chapters. 
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Chapter  3 

Epidemic  thresholds  in  viral  propagation 


“Will  a  viral  outbreak  on  a  computer  network  spread  to  epidemic 
proportions,  or  will  it  quickly  die  out?” 

The  relevance  of  this  question  to  the  design  of  computer  networks  is  obvious.  Starting  from  some 
small  comer  of  the  Internet,  modem  viruses  can  quickly  spread  across  the  network.  This  network 
can  be  a  physical,  such  as  the  network  of  routers,  or  it  could  be  an  “overlay”  network,  such  the  “ad- 
dressbook”  network  (for  viruses  which  spread  via  email).  However,  modem  anti-virus  techniques 
are  typically  “local”:  computers  are  disinfected  one  at  a  time,  without  considering  the  survival 
of  the  virus  in  the  network.  What  is  needed  is  method  of  combating  viruses  “globally,”  and  that 
requires  an  understanding  of  the  network-propagation  behavior  of  viral  outbreaks. 

In  addition  to  computer  networks,  the  same  questions  appear  in  several  other  fields  as  well. 
Rumors,  trends  and  the  latest  fads  spread  across  a  social  network ,  and  some  fashion  can  gain 
prominence  through  “word  of  mouth”  publicity.  Dependable  systems  must  be  designed  to  pre¬ 
vent  cascading  failures.  Immunization  programs  need  to  target  individuals  who  are  at  a  high  risk, 
possibly  due  to  their  position  in  the  social  network.  Similarly,  the  effectiveness  of  information 
dissemination  or  viral  marketing  campaigns  depends  on  reaching  “influential”  individuals  in  the 
network.  In  all  of  these  and  many  other  cases,  a  good  understanding  of  the  propagation  behavior 
over  arbitrary  network  topologies  is  important. 

Our  contributions,  as  detailed  in  [Wang  et  ah,  200:  ],  are  in  answering  the  following  questions: 

•  How  does  a  virus  spread?  Specifically,  we  want  an  analytical  model  of  viral  propagation, 
that  is  applicable  for  any  network  topology. 

•  When  does  the  virus  die  out,  and  when  does  it  become  endemic?  Conceptually,  a  tightly 
connected  graph  offers  more  possibilities  for  the  virus  to  spread  and  survive  in  the  network 
than  a  sparser  graph.  Thus,  the  same  virus  might  be  expected  to  die  out  in  one  graph  and 
become  an  epidemic  in  another.  What  features  of  the  graph  control  this  behavior?  We  find  a 
simple  closed-form  expression  for  the  “epidemic  threshold”  below  which  the  virus  dies  out, 
but  above  which  it  can  become  an  epidemic. 

•  Below  the  threshold,  how  quickly  will  the  virus  die  out?  A  logarithmic  decay  of  the  virus 
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might  still  be  too  slow  to  have  practical  impact. 

We  will  first  present  some  background  material  relevant  to  this  topic  in  Section  3.1.  We  then 
present  our  mathematical  model  of  viral  propagation  in  Section  3.2,  and  our  derivation  of  the 
epidemic  threshold  condition  in  Section  3.3.  Finally,  we  experimentally  demonstrate  the  accuracy 
of  our  model  in  Section  3.4.  Details  of  proofs  are  presented  in  Section  3.5,  followed  by  a  summary 
in  Section  3.6. 


3.1  Related  Work 

The  process  of  viral  propagation  has  been  studied  is  several  communities,  including  epidemiolo¬ 
gists,  computer  scientists,  physicists  and  probability  theorists.  The  broad  focus  of  all  this  work 
has  been  on  two  orthogonal  aspects:  the  behavior  of  a  single  node  in  the  graph,  and  the  effect  of 
network  topology. 

3.1.1  Behavior  of  a  single  node 

How  do  individual  nodes  behave  during  a  viral  infection?  The  epidemiological  community  has 
studied  this  problem  for  a  long  time,  and  several  models  have  been  proposed  [  Bailey,  1975],  We 
will  look  at  the  most  important  ones  below: 

•  Susceptible-Infective-Susceptible  (SIS)  Model:  In  this  model,  each  node  can  be  in  one  of 
two  states:  healthy  but  susceptible  (S)  to  infection,  or  infected  (I).  An  infected  node  can  be 
cured  locally  (say,  by  anti-virus  software),  and  it  immediately  becomes  susceptible  again. 
An  infected  node  can  also  spread  the  virus  along  network  links  to  neighboring  susceptible 
nodes. 

•  Susceptible-Infective-Removed  (SIR)  Model:  Here,  once  an  infected  node  is  cured,  it  is  im¬ 
mune  to  further  infections  and  is  removed  (R)  from  the  network. 

•  Extensions:  Several  extensions  to  these  basic  models  have  been  proposed.  Two  common 
ideas  are  “infection  delay”  and  “user  vigilance”  [Wang  and  Wang,  2003].  Infection  delay 
represents  a  delay  in  spreading  the  virus  from  an  infected  node.  User  vigilance  is  some 
time  period  after  a  disinfection  when  the  user  is  vigilant  against  new  infections,  reducing  the 
susceptibility  of  that  node. 

While  each  model  is  suitable  for  different  applications,  we  focus  on  the  plain  SIS  model,  and 
all  our  future  discussions  will  be  based  on  it.  The  virus  is  modeled  by  two  parameters:  the  virus 
“birth  rate”  (3,  with  which  it  tries  to  spread  across  a  network  link  to  a  susceptible  node;  and  the 
virus  “death  rate”  5  with  which  each  infected  node  disinfects  itself.  If  death  rate  is  very  low,  the 
infection  should  survive  for  a  long  time,  but  if  the  birth  rate  is  very  low,  it  should  die  off  quickly. 
Is  there  some  condition/threshold  that  determines  the  fate  of  a  viral  outbreak? 

This  is  called  the  “epidemic  threshold”  of  the  network,  and  has  been  the  focus  of  a  lot  of 
research  activity.  Informally,  the  epidemic  threshold  is  a  value  r  such  that 
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(3.1) 


If  l<r.  the  viral  outbreak  dies  out  quickly,  but 

iff  >r,  the  virus  may  become  endemic. 

We  shall  provide  a  formal  definition  later  in  the  text. 


3.1.2  Interactions  between  nodes 


While  the  previous  paragraphs  described  the  spread  of  a  virus  in  the  neighborhood  of  an  infected 
node,  the  epidemic  threshold  is  a  global  value  depending  on  the  topology  of  the  entire  network. 
This  question  has  been  approached  from  several  viewpoints,  of  which  the  major  ones  we  discuss 
below. 


The  KW  model:  Kephart  and  White  [  ,  ,  ]  modeled  the  network  as 

directed  graph,  but  with  a  constrained  topology  (either  an  Erdos-Renyi  random  graph,  or  a  tree,  or 
a  2D  lattice).  For  this  model  (which  we  will  call  KW),  they  derived  steady-state  solutions  for  the 
number  of  infected  nodes  tjkw\  for  example,  for  the  random  graph,  it  is: 

mw  =  N(1~  ^y)  (3-2) 


where  N  is  the  total  number  of  nodes,  and  (k)  the  average  degree.  They  also  derived  the  epidemic 
threshold: 


tkw  — 


(k) 


(3.3) 


While  this  is  a  good  model  for  networks  where  contact  among  nodes  is  sufficiently  homoge¬ 
neous,  there  is  overwhelming  evidence  that  real  networks  (including  social  networks  [Domingos 
and  Richardson,  200  ],  router  and  AS  networks  [  aloutsos  et  al.,  1999],  and  Gnutella  overlay 
graphs  [Ripeanu  et  al.,  2002])  deviate  from  such  homogeneity — they  follow  a  power  law  structure 
instead.  In  such  graphs,  there  exist  a  few  nodes  with  very  high  connectivity,  but  the  majority  of 
the  nodes  have  low  connectivity.  The  high-connectivity  nodes  are  expected  to  often  get  infected 
and  then  propagate  the  virus,  making  the  infection  harder  to  eradicate.  We  shall  show  that  the  KW 
model  is  a  special  case  of  our  model,  and  we  match  its  predictions  for  homogeneous  graphs. 


The  MFA  model  (Mean  Field  Assumption  model):  Pastor-Satorras  and  Vespignani  [Pastor-Satorras 
and  Vespignani,  2001,  Pastor-Satorras  et  al.,  2001,  Pastor-Satorras  and  Vespignani,  2002a, b]  al¬ 
lowed  the  network  topology  to  be  a  power-law.  To  derive  analytic  results,  they  used  the  mean-field 
assumption  (MFA),  where  all  nodes  with  the  same  degree  are  treated  equally.  We  call  this  the 
“MFA”  model.  For  the  special  case  of  the  Barabasi- Albert  power-law  topology  [Barabasi  and 
Albert,  1 9  ],  their  steady-state  infected  population  size  is: 

t]mfa  =  2  Ne~5^ 


where  m  is  the  minimum  degree  in  the  network.  However,  this  has  not  been  extended  to  all  power- 
law  graphs  in  general.  They  also  derive  an  epidemic  threshold: 


(k) 

MFA  (k2) 


(3.4) 
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where  (k)  is  the  average  degree,  and  (k2)  the  average  of  the  squared  degree. 

While  the  mean-field  assumption  lets  us  compute  a  general  epidemic  threshold,  there  is  no 
reason  for  any  two  nodes  with  the  same  degree  to  be  equally  affected  by  the  virus.  One  node  could 
be  in  the  core  of  a  network,  and  another  in  the  periphery;  the  one  in  the  core  will  presumably 
be  infected  more  often.  Indeed,  we  observe  experimentally  that  our  model  yields  more  accurate 
predictions  than  the  MFA  model. 

Correlated  networks:  Boguna  and  Satorras  [  ,  ]  studied  networks 

where  the  connectivity  of  a  node  is  related  to  the  connectivity  of  its  neighbors.  These  correlated 
networks  include  Markovian  networks  where,  in  addition  to  the  degree  distribution  P(k),  a  func¬ 
tion  P(k\k')  determines  the  probability  that  a  node  of  degree  k  is  connected  to  a  node  of  degree 

//. 

While  some  degree  of  correlations  may  exist  in  real  networks,  it  is  often  difficult  to  character¬ 
ize  connectivity  correlations  with  a  simple  P{k\k')  function.  Computing  these  values  in  real-world 
graphs  with  strong  confidence  guarantees  is  also  hard.  In  addition,  our  results  indicate  that  knowl¬ 
edge  of  P{k\k')  is  not  needed  to  compute  the  epidemic  threshold  for  arbitrary  graphs. 

Interacting  particle  systems:  This  is  a  branch  of  probability  theory  where  “particles”  are  allowed 
to  propagate  over  a  simple  network  according  to  different  processes;  the  one  that  is  closest  to  our 
work  is  the  “contact  process”  [  larris,  1974,  Liggett,  1 985].  This  area  has  seen  some  excellent  and 
rigorous  theoretical  work.  However,  the  networks  that  are  studied  are  (a)  infinite,  which  changes 
the  questions  being  asked,  and  (b)  simple,  typically  only  line  graphs  and  grids  [Durrett  and  Li  , 
1988,  Durrett  and  Schonmann,  1988,  Durrett  et  al.,  1989].  Instead,  the  (possibly  arbitrary)  network 
topology  is  a  central  part  of  our  work.  Asavathiratham  [Asavathiratham,  ]  studies  similar 
processes,  but  does  not  investigate  epidemic  thresholds. 

Thus,  all  current  methods  either  constrain  the  network  topology  or  use  strong  assumptions. 
Next,  we  propose  a  general  method  of  modeling  viral  propagation  on  any  finite  graph ,  without 
having  to  resort  to  the  mean-field  approximation. 


3.2  Model  of  Viral  Propagation 

We  develop  a  mathematical  model  for  the  SIS  method  of  viral  infection,  which  is  applicable  to  any 
undirected  graph  G.  Our  model  assumes  very  small  discrete  timesteps  of  size  At,  where  At  — *•  0. 
Our  results  apply  equally  well  to  continuous  systems,  though  we  focus  on  discrete  systems  for 
ease  of  exposition.  Within  a  At  time  interval,  an  infected  node  i  tries  to  infect  its  neighbors  with 
probability  /3.  At  the  same  time,  i  may  be  cured  with  probability  5.  Table  3.1  defines  these  and 
other  symbols. 


3.2.1  Viral  propagation  as  a  Markov  chain 

The  spread  of  the  virus  is  governed  by  a  Markov  chain  with  2N  states.  Each  state  in  the  Markov 
chain  corresponds  to  one  particular  system  configuration  of  N  nodes,  each  of  which  can  be  in 
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one  of  two  states  (Susceptible  or  Infective),  which  leads  to  2N  possible  configurations.  Also,  the 
configuration  at  time-step  t  +  1  depends  only  on  that  at  time-step  t:  thus,  it  is  a  Markov  chain. 

This  Markov  chain  also  has  an  “absorbing”  state,  when  all  nodes  are  uninfected  (i.e.,  Suscep¬ 
tible).  This  absorbing  state  can  be  reached  from  any  starting  state  of  the  Markov  chain,  implying 
that  this  state  will  be  reached  with  probability  1  over  a  long  period  of  time.  However,  this  state 
could  be  reached  very  quickly,  or  it  could  take  time  equivalent  to  the  age  of  the  universe  (in  which 
case,  the  viral  epidemic  practically  never  dies). 

3.2.2  Main  Idea 

The  obvious  approach  of  solving  the  Markov  chain  becomes  infeasible  for  large  N,  due  to  the  ex¬ 
ponential  growth  in  the  size  of  the  chain.  To  get  around  this  limitation,  we  use  the  “independence” 
assumption,  that  is,  the  states  of  the  neighbors  of  any  given  node  are  independent.  Thus,  we  replace 
the  problem  with  Equation  3.6  (our  “non-linear  dynamical  system”  discussed  below),  with  only  N 
variables  instead  of  2N  for  the  full  Markov  chain.  This  makes  the  problem  tractable,  and  we  can 
find  closed-form  solutions. 

Note  that  the  independence  assumption  places  no  constraints  on  network  topology;  our  method 
works  with  any  arbitrary  finite  graph.  Also,  this  assumption  is  far  less  constraining  than  the  mean- 
field  assumption.  In  fact,  as  we  will  show  experimentally,  the  number  of  infected  nodes  over  time 
under  the  “independence  assumption”  is  very  close  to  the  that  without  any  assumptions,  for  a  wide 
range  of  datasets. 

3.2.3  Mathematical  formulation 

Let  the  probability  that  a  node  i  is  infected  at  time  t  by  pL{t).  Let  Q(t)  be  the  probability  that  a 
node  i  will  not  receive  infections  from  its  neighbors  in  the  next  time-step.  This  happens  if  each 
neighbor  is  either  uninfected,  or  is  infected  but  fails  to  spread  the  virus  with  probability  (1  —  /3): 

Ci(t)  =  IJ  (Pj(f-1)(1-/3)  +  (1-pj(f-1))) 

j -.neighbor  of  i 

n  (3.5) 

j -.neighbor  of  i 

This  is  the  independence  assumption:  we  assume  that  probabilities  pj(t  —  1)  are  independent  of 
each  other. 

A  node  i  is  healthy  at  time  t  if  it  did  not  receive  infections  from  its  neighbors  at  t  and  i  was 
uninfected  at  time- step  t  —  1,  or  was  infected  but  was  cured  at  t.  Denoting  the  probability  of  a  node 
i  being  infected  at  time  t  by  Pi(t): 

1  ~Pi(t)  =  (1  -Piit  -  1))  •  Ci(t)  +  5  -pi(t  -  1)  •  (ft)  i  —  1 . . .  N  (3.6) 

This  equation  represents  our  NLDS  (Non-Linear  Dynamical  System).  Ligure  3.1  shows  the  tran¬ 
sition  diagram. 
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Resisted  infection 


Cured 


Figure  3.1:  The  SIS  model,  as  seen  from  a  single  node:  Each  node,  in  each  time  step  t,  is  either 
Susceptible  (S)  or  Infective  (I).  A  susceptible  node  i  is  currently  healthy,  but  can  be  infected  (with 
probability  1  —  Qtt)  on  receiving  the  virus  from  a  neighbor.  An  infective  node  can  be  cured  with 
probability  <5;  it  then  goes  back  to  being  susceptible.  Note  that  Q:t  depends  on  the  both  the  virus 
birth  rate  j3  and  the  network  topology  around  node  i. 


Symbol 

Description 

G 

An  undirected  connected  graph 

N 

Number  of  nodes  in  Q 

E 

Number  of  edges  in  Q 

A 

Adjacency  matrix  of  Q 

A' 

The  transpose  of  matrix  A 

P 

Virus  birth  rate  on  a  link  connected  to  an  in¬ 
fected  node 

5 

Virus  death  rate  on  an  infected  node 

t 

Time  stamp 

Pi(t) 

Probability  that  node  i  is  infected  at  t 

Pit) 

Pit )  =  ipiit),P2it),...,pNit))' 

m 

Probability  that  node  i  does  not  receive  infec¬ 
tions  from  its  neighbors  at  t 

The  i-th  largest  eigenvalue  of  A 

'U'ifA 

Eigenvector  of  A  corresponding  to  \a 

'LL i,A 

Transpose  of  u^a 

s 

The  ‘system’  matrix  describing  the  equations  of 
infection 

\,s 

The  i-th  largest  eigenvalue  of  S 

Vt 

Number  of  infected  nodes  at  time  t 

(k) 

Average  degree  of  nodes  in  a  network 

(k2) 

Connectivity  divergence  (average  of  squared  de¬ 
grees) 

Table  3.1:  Table  of  Symbols 
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We  have  thus  turned  the  Markov  chain  into  a  non-linear  dynamical  system  (Equation  3.6),  and 
we  will  focus  on  its  behavior.  Specifically,  we  will  find  the  epidemic  threshold  for  this  system.  An 
informal  definition  of  the  epidemic  threshold  was  presented  in  Equation  3.1;  we  will  now  present 
a  formal  definition  for  the  NLDS  case. 

Definition  10  (Epidemic  threshold  under  NLDS).  The  epidemic  threshold  t^i.ds  for  NLDS  is 

a  value  such  that 

(3/5  <  t nlds  =>■  infection  dies  out  over  time,  pft )  — >  0  as  t  — >  oo  Vi 
/3/5>tnlds  =>•  infection  survives  and  becomes  an  epidemic 


3.3  The  Epidemic  Threshold 


In  this  section,  we  shall  discuss  our  simple  closed-form  formula  for  rNLDS.  Surprisingly,  it  depends 
only  on  one  number:  the  largest  eigenvalue  of  the  graph.  Then,  we  will  prove  that  below  the 
threshold  (i.e.,  when  the  viral  outbreak  dies),  the  decay  is  exponential.  Finally,  we  will  present 
some  corollaries  and  special  cases,  which  demonstrate  the  intuitiveness  of  the  result. 


Theorem  1  (Epidemic  Threshold).  In  NLDS,  the  epidemic  threshold  TN1yDs  for  an  undirected 
graph  is 


TnLDS  - 


(3.7) 


where  A  i  4  is  the  largest  eigenvalue  of  the  adjacency  matrix  A  of  the  network. 


Proof  We  will  prove  this  in  two  parts:  the  necessity  of  this  condition  in  eliminating  an  infection, 
and  the  sufficiency  of  this  condition  for  wiping  out  any  initial  infection.  The  corresponding  the¬ 
orem  statements  are  shown  below;  the  proofs  are  described  in  Section  3.5.  Following  this,  we 
will  see  how  quickly  an  infection  dies  out  if  the  epidemic  threshold  condition  is  satisfied.  Recent 
follow-up  work  on  this  problem  by  Ganesh  et  al.  [Ganesh  et  ah,  2005]  has  confirmed  our  results, 
and  added  more  results  about  system  behavior  above  the  epidemic  threshold.  □ 

Theorem  2  (Part  A:  Necessity  of  Epidemic  Threshold).  In  order  to  ensure  that  over  time,  the 
infection  probability  of  each  node  in  the  graph  goes  to  zero  ( that  is,  the  epidemic  dies  out),  we 
must  have  §  <  t^lds  —  • 

0  M  ,A 

Proof  Proved  in  Section  3.5.  □ 

Theorem  3  (Part  B:  Sufficiency  of  Epidemic  Threshold).  If  f  <  tnlds  =  ,  then  the 

epidemic  will  die  out  over  time  (the  infection  probabilities  will  go  to  zero),  irrespective  of  the  size 
of  the  initial  outbreak  of  infection. 


□ 


M- 


Proof.  Proved  in  Section  3.5. 

Definition  11  (Score).  Score  s  =  f  ■  A 
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Theorem  1  provides  the  conditions  under  which  an  infection  dies  out  (s  <  1)  or  survives  (s  >  1) 
in  our  dynamical  system.  To  visualize  this,  consider  the  spread  of  infection  as  a  random  walk  on 
the  graph.  The  virus  spreads  across  one  hop  according  to  j3A,  and  thus  it  spreads  across  h  hops 
according  to  (/ 3A)h ,  which  grows  as  (/3Ai  ^)  every  hop.  On  the  other  hand,  the  virus  dies  off  at  a 
rate  5.  Thus,  the  “effective”  rate  of  spread  is  approximately  (3\^A/8,  which  is  exactly  the  “score” 
s.  Thus,  to  have  any  possibility  of  an  epidemic,  the  score  s  must  be  greater  than  1.  This  is  exactly 
the  epidemic  threshold  condition  that  we  find. 

We  can  ask  another  question:  if  the  system  is  below  the  epidemic  threshold,  how  quickly  will  an 
infection  die  out? 


Theorem  4  (Exponential  Decay).  When  an  epidemic  is  diminishing  ( therefore  d/5  < 
s  <  1),  the  probability  of  infection  decays  at  least  exponentially  over  time. 


i 

Ai,a 


and 


Proof  Proved  in  Section  3.5. 


□ 


We  can  use  Theorem  1  to  compute  epidemic  thresholds  for  many  special  cases,  as  detailed  below. 
All  of  these  are  proved  in  Section  3.5. 

Corollary  1.  NLDS  subsumes  the  KW  model  for  homogeneous  or  random  Erdos-Renyi  graphs. 

Corollary  2.  The  epidemic  threshold  t^lds  f°r  a  star  topology,  is  exactly  where  \fd  is  the 
square  root  of  the  degree  of  the  central  node. 

Corollary  3.  The  epidemic  threshold  for  an  infinite  power-law  network  is  zero. 

Corollary  4.  Below  the  epidemic  threshold  (score  s  <  1),  the  expected  number  of  infected  nodes 
rjt  at  time  t  decays  exponentially  over  time. 


“Optimistic”  versus  “Pessimistic”  scenarios:  The  SIS  model  is  a  “pessimistic”  model  of  viral 
infection:  no  node  is  ever  immunized,  and  a  cured  node  is  immediately  susceptible,  making  it 
very  hard  for  us  to  stop  the  viral  spread.  As  against  this,  we  can  think  of  the  SIR  model  as  the 
“optimistic”  model.  Thus,  if  we  design  a  network  topology  to  be  resistant  to  viral  epidemics  at 
some  epidemic  threshold  (given  by  Theorem  1)  for  the  SIS  model,  it  should  be  resistant  even 
under  the  SIR  model. 

Again,  irrespective  of  the  model,  we  can  have  different  starting  conditions  for  the  viral  infec¬ 
tion:  one  node  infected  at  time  t  =  0,  all  nodes  infected,  or  somewhere  in  between.  Theorem  1 
holds  even  in  the  pessimistic  case  where  all  nodes  start  off  infected:  if  we  are  below  the  threshold 
(score  s  <  1),  then  the  epidemic  dies  out  quickly  regardless  of  the  starting  state. 

Thus,  since  we  work  under  a  pessimistic  scenario,  our  results  will  hold  for  many  other  models 
and  conditions.  The  price  for  this  generality  is  that  our  results  may  be  too  conservative;  perhaps 
less  stringent  conditions  would  suffice  to  drive  the  virus  extinct  in  optimistic  scenarios.  There 
may  also  be  other  applications  where  the  optimistic  scenario  makes  more  sense.  These  are  all 
interesting  questions,  and  we  intend  to  study  them  in  the  future. 
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Dataset 

Nodes 

Edges 

Largest 

Eigenvalue 

RANDOM 

256 

982 

8.691 

POWER-LAW 

3,000 

5,980 

11.543 

STAR-10K 

10,001 

10,000 

100 

OREGON 

11,461 

32, 730 

75.241 

Table  3.2:  Dataset  characteristics. 


3.4  Experiments 

We  want  to  answer  the  following  questions: 

(QD  How  closely  does  our  dynamical  system  model  the  spread  of  a  viral  infection? 

(Q2)  How  accurate  is  our  epidemic  threshold  condition? 

(Q3)  How  does  our  epidemic  threshold  compare  to  those  suggested  in  [  'astor-Satorras  and  Vespig- 
nani,  2002a]? 

(Q4)  Below  the  threshold,  does  the  epidemic  die  out  exponentially  quickly? 

The  datasets  we  used  were: 

•  RANDOM :  An  Erdos-Renyi  random  graph  of  256  nodes  and  982  edges. 

•  POWER-LAW :  A  graph  of  3,  000  nodes  and  5,  980  edges,  generated  by  the  popular  Barabasi- 
Albert  process  [Barabasi  and  Albert,  199  ].  This  generates  a  graph  with  a  power- law  degree 
distribution  of  exponent  3. 

•  STAR-10K :  A  “star”  graph  with  one  central  hub  connected  to  10,  000  “satellites.” 

•  OREGON'.  A  real-world  graph  of  network  connections  between  Autonomous  Systems  (AS), 
obtained  from  http  :  /  /topology  .  eecs  .  umich  .  edu/ data  .  html.  It  has  11, 461 
nodes  and  32,  730  edges. 

For  each  dataset,  all  nodes  were  initially  infected  with  the  virus,  and  then  its  propagation  was 

studied  in  a  simulator.  All  simulations  were  run  for  10,  000  timesteps,  and  were  repeated  100  times 

with  different  seeds.  Table  3.2  provides  more  details. 


3.4.1  (Ql)  Accuracy  of  dynamical  system 

In  figure  3.2,  we  plot  the  number  of  infected  nodes  over  time,  and  compare  it  against  the  evolution 
of  NLDS  (Equation  3.6).  Several  values  of  the  score  s  were  used  (below,  at  and  above  the  thresh¬ 
old).  In  all  cases,  the  simulation  results  are  extremely  close  to  those  from  NLDS.  Thus,  NLDS  is 
a  good  approximation  of  the  true  system. 
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Figure  3.2:  Simulation  versus  dynamical  system:  The  number  of  infected  nodes  is  plotted  versus 
time,  for  both  simulations  (lines)  and  NLDS  (points).  Observe  the  close  fit  between  the  two. 


Thus,  the  dynamical  system  of  equation  3.6  is  a  very  good  representation  of  viral  propagation 
in  arbitrary  networks. 


3.4.2  (Q2)  Accuracy  of  epidemic  threshold 

Figure  3.3  shows  the  number  of  infected  nodes  over  time  for  various  values  of  the  score  s,  in  log- 
log  scales.  We  observe  a  clear  trend:  below  the  threshold  (s  <  1),  the  infection  dies  off,  while  it 
survives  above  the  threshold  (s  >  1).  This  is  exactly  as  predicted  by  Theorem  1,  and  justifies  our 
formula  for  the  threshold. 

Thus,  our  epidemic  threshold  condition  is  accurate:  infections  become  extinct  below  the  thresh¬ 
old,  and  survive  above  it. 

3.4.3  (Q3)  Comparison  with  previous  work 

In  figure  3.4,  we  compare  the  predicted  threshold  of  our  model  against  that  of  the  MFA  model. 
For  several  values  of  the  score  s,  we  plot  the  number  of  infected  nodes  left  after  a  “long”  time 
(specifically  500, 1000  and  2000  timesteps).  Below  the  threshold,  the  infection  should  have  died 
out,  while  it  could  have  survived  above  the  threshold.  In  all  cases,  we  observe  that  this  change  in 
behavior  (extinction  to  epidemic)  occurs  at  our  predicted  epidemic  threshold. 

Note  that  the  NLDS  threshold  is  more  accurate  than  that  of  the  MFA  model  for  the  STAR- 
10K  and  the  real-world  OREGON  graphs,  while  we  subsume  their  predictions  for  RANDOM  and 
POWER-LAW ,  which  are  the  topologies  the  MFA  model  was  primarily  developed  for. 

Recalling  also  that  our  model  subsumes  the  KW  model  on  homogeneous  networks  (Corol¬ 
lary  1),  we  arrive  at  the  following  conclusion:  our  epidemic  threshold  for  NLDS  subsumes  or 
performs  better  than  those  for  other  models. 


3.4.4  (Q4)  Exponential  decay  of  infection  under  threshold 

Figure  3.5  demonstrates  the  rate  of  decay  of  the  infection  when  the  system  is  below  the  threshold 
(s  <  1).  The  number  of  infected  nodes  is  plotted  against  time  on  a  log-linear  scale  We  see  that 
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(a)  RANDOM 


(b)  POWER-LAW 


(d)  OREGON 


Figure  3.3:  Accuracy  of  our  epidemic  threshold:  The  number  of  infected  nodes  is  plotted  versus 
time  for  various  values  of  the  score  s  (log-log  scales).  The  case  when  score  s  =  1  is  shown  with 
the  dotted  line.  There  is  a  clear  distinction  between  the  cases  where  s  <  1  and  s  >  1:  below  1, 
the  infection  dies  out  quickly,  while  above  1,  it  survives  in  the  graph.  This  is  exactly  our  proposed 
epidemic  threshold  condition. 


(a)  RANDOM 


(b)  POWER-LAW 


Figure  3.4:  Comparison  with  the  MFA  model:  We  plot  number  of  infected  nodes  after  a  “long” 
time  for  various  values  of  the  score  s,  versus  s.  For  each  dataset,  we  show  results  after  500, 1000 
and  2000  timesteps.  In  each  case,  we  observe  a  sharp  jump  in  the  size  of  the  infected  population  at 
our  epidemic  threshold  of  s  —  1 .  Note  that  our  results  are  far  more  accurate  than  those  of  the  MFA 
model. 
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(a)  RANDOM 


(b)  POWER-LAW 


Figure  3.5:  Exponential  decay  below  the  threshold:  The  number  of  infected  nodes  is  plotted  versus 
time  for  various  values  of  the  score  s  <  1  (log-linear  scales).  Clearly,  the  decay  is  exponentially 
quick  (because  it  is  linear  on  a  log-linear  plot).  This  matches  the  predictions  of  Theorem  4. 

in  all  cases,  the  decay  is  exponential  (because  the  plot  looks  linear  on  a  log-linear  scale).  This  is 
exactly  as  predicted  by  Theorem  4. 

Thus,  the  infection  dies  out  exponentially  quickly  below  the  threshold  (s  <  1). 


3.5  Details  of  Proofs 


Theorem  1  (Epidemic  Threshold).  In  NLDS,  the  epidemic  threshold  t^lds  for  an  undirected 
graph  is 


Tnlds  ~  xlfo 


(3.8) 


where  X\a  is  the  largest  eigenvalue  of  the  adjacency  matrix  A  of  the  network. 


Proof  The  proof  follows  from  the  proofs  of  Theorems  2  and  3  below.  □ 

Theorem  2  (Part  A:  Necessity  of  Epidemic  Threshold).  In  order  to  ensure  that  over  time,  the 

infection  probability  of  each  node  in  the  graph  goes  to  zero  ( that  is,  the  epidemic  dies  out),  we 

must  have  §  <  t^lds  — 

0  M,a 

Proof  We  have  modeled  viral  propagation  as  a  discrete  dynamical  system,  with  the  following 
non-linear  dynamical  equation: 

1  -Pi(t)  =  (1  -Pi(t-  1))  ■  (i(t)  +5  -Pi(t-  1)  ■  Q(t)  (from  Eq  3.6) 
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or,  p{t) 

where,  gt  ( p(t  -  1)) 


(3.9) 


=  9  {Pit-  1)) 

=  1  -  (1  -  Pi{t  ~  1))  •  (i(t)  -  S  ■  pi{t  -  1)  •  (fit) 


where  g  :  [0, 1];V  — >  [0, 1  ]  v  is  a  function  on  the  probability  vector,  and  gfi.)  is  its  z-th  element. 

The  infection  dies  out  when  pt  =  0  for  all  i.  We  can  easily  check  that  the  p  =  0  vector  is  a.  fixed 
point  of  the  system;  when  pfit  —  1)  =  0  for  all  i  (all  nodes  healthy),  the  equation  above  results  in 
Pi(t)  =  0  for  all  z,  and  so  all  nodes  remain  healthy  forever. 

The  question  we  need  to  answer  is:  If  the  infection  probabilities  of  all  nodes  in  the  graph 
come  close  to  zero,  will  the  dynamical  system  push  them  even  closer  to  zero?  If  yes,  the  infection 
probabilities  will  go  to  zero  and  the  infection  will  die  out,  but  if  not,  the  infection  could  survive  and 
become  an  epidemic.  This  concept  is  known  as  asymptotic  stability,  and  conditions  for  achieving 
asymptotic  stability  are  well-known. 


Definition  12  (Asymptotic  Stability  of  a  Fixed  Point).  A  fixed  point  Pf  is  “asymptotically  stable” 
if,  on  a  slight  perturbation  from  Pf,  the  system  returns  to  Pf  (as  against  moving  away,  or  staying 
in  the  neighborhood  of  Pf  but  not  approaching  it)  [  lirsch  and  Smale,  1974], 


Lemma  1  (Condition  for  Asymptotic  stability).  The  system  is  asymptotically  stable  al  p  =  0 


if  the  eigenvalues  o/Vg(0)  are  less  than  1  in  absolute  value.  Recall  that  Vg{cc)  = 


dgt 

dxi 


for 


i,j  =  1 ...  N,  and  thus, 


Vfi<0) 


_  dgj 

dPj 


p= O' 


Proved  in  [Hirsch  and  Smale,  1974], 


From  Eq  3.9,  we  can  calculate  Wg( 0): 


Vg(0) 


Thus,  Vg(0) 


f  f3Ajf  for  j  f  i 

[1  —  5  for  j  =  i 

(3A!  +  (1  -  5)1 
(3A+  (1-5)1 


(3.10) 


where  the  last  step  follows  because  A  =  A'  (since  the  graph  is  undirected). 

This  matrix  describes  the  behavior  of  the  virus  when  it  is  very  close  to  dying  out;  we  call  it  the 
system  matrix  S: 

s  =  Vs(0)  =  (3A  +  (1  -  5)1  (3.11) 

As  shown  in  Lemma  2  following  this  proof,  the  matrices  A  and  S  have  the  same  eigenvectors  uits, 
and  their  eigenvalues,  Xl:.\  and  \itS,  are  closely  related: 


Xits  —  1  —  5  +  (3X  itA  Vz 


(3.12) 


Hence,  using  the  stability  condition  above,  the  system  is  asymptotically  stable  when 

|Ai,s|  <  1  Vz  (3.13) 

that  is,  all  eigenvalues  of  S  have  absolute  value  less  than  one. 

Now,  since  A  is  a  real  symmetric  matrix  (because  the  graph  is  undirected),  its  eigenvalues  A,, ,4 
are  real.  Thus,  the  eigenvalues  of  S  are  real  too.  Also,  since  the  graph  Q  is  a  connected  undirected 
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graph,  the  matrix  A  is  a  real,  nonnegative,  irreducible,  square  matrix.  Under  these  conditions, 
the  Perron-Frobenius  Theorem  [MacCluer,  2000]  says  that  the  largest  eigenvalue  is  a  positive  real 
number  and  also  has  the  largest  magnitude  among  all  eigenvalues.  Thus, 


Ai,s  —  >  lA^sl  Vi 


(3.14) 


Using  this  in  Equation  3.13: 


Al,5  <  1 

that  is,  1  —  5  +  /3Ai,a  <  1 


which  means  that  an  epidemic  is  prevented  if  /3/S  <  1/Ai,a-  Also,  if  P/6  >  1/Ai :a,  the  probabili¬ 
ties  of  infection  may  diverge  from  zero,  and  an  epidemic  could  occur.  Thus,  the  epidemic  threshold 


is 


t~nlds  — 


Ai, 


A 


□ 


Lemma  2  (Eigenvalues  of  the  system  matrix).  The  i  —  th  eigenvalue  of  S  is  of  the  form  Ahs  = 
1  —  5  +  (3  A itA,  and  the  eigenvectors  of  S  are  the  same  as  those  of  A. 


Proof  Let  ui  A  be  the  eigenvector  of  A  corresponding  to  eigenvalue  A Then,  by  definition, 
AuiiA  =  A iiA  ■  Ui,A  Now, 

Slli  A  =  (1  -  5)uijA  +  /3AUi,A 
=  (1  -  5)ui)A  +  /3Aj,AUi,A 

=  (1  -  5  +  /TV,A)ui,A  (3.15) 

Thus,  U;  a  is  also  an  eigenvector  of  S,  and  the  corresponding  eigenvalue  is  (1  —  5  +  /3A jiA). 

Conversely,  suppose  A its  is  an  eigenvalue  of  S  and  ui  S  is  the  corresponding  eigenvector.  Then, 


Aj,5Ui,S 


A  i,s  +  5  —  1 

P 


Ui,s 


Sui,s 

(1  -  5)uiiS  +  (?AuijS 


Auis 


Thus,  uits  is  also  an  eigenvector  of  A,  and  the  corresponding  eigenvalue  of  A  is  AhA  =  (A its  + 

5-1)//?.  ’  ’  □ 

Theorem  3  (Part  B:  Sufficiency  of  Epidemic  Threshold).  If  §  <  t~nlds  =  — >  then  the 

0  M,a 

epidemic  will  die  out  over  time  (the  infection  probabilities  will  go  to  zero),  irrespective  of  the  size 
of  the  initial  outbreak  of  infection. 


Proof 

(i(t)  =  JJ  (1  -  p  ■  pj(t  -  1))  (from  Eq.  3.5) 

j -.neighbor  of  i 

>  1  -P-  Y,  (3.1-6) 

j -.neighbor  of  i 
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where  the  last  step  follows  because  all  terms  are  positive. 
Now,  for  %  —  1 ...  N, 

i  -Pi(t)  =  (i  -Pi(t  - 1))  •  Ci(t)  +  8  i)  •  (i(t) 
=  (l  -  (l  -5)  -Pi(t- 1)) 


>  (l-(l-S)-pi(t-l))-  1-/3 


Y 

j -.neighbor  of  i  / 


(using  Eq.  3.16) 

>  (1  -  (1  -  5)  •  pi(t  -  1))  •  (  1  -  (3  Y A jti  ■  pj(t  -  1) 


N 


3= 1 


(because  AJ  ?;  =  1  for  neighbors  only) 

>  1  -  (1  -  5)  -pi(t  -  1) 

-(3  Y  AU  ■  Pjh  -  3)  +  (3  ■  (1  -  5)  ■  Pi(t  -  1)  Y  Aj,i  •  Pj(t  -  x) 


> 


1  -  (1  -  <5)  •  pi(t  -  1)  -  f3  Y  Ai,i '  Pj(t  ~  !) 


(3.17) 


No  assumptions  were  required  in  this. 

Thus, 

Pi(t)  <  (l-6)-pi{t-l)  +  (3YA3,i-P3(t-'1) 


(3.18) 


Writing  this  in  vector  form,  we  observe  that  this  uses  the  same  system  matrix  S  from  Equation  3.11: 

p(t)  <  S p{t  —  1)  (using  the  definition  of  S  from  Eq.  3.11) 

<  S 2p(t  —  2)  <  . . . 

<  sV(o) 

<  Y  'W  ^ s'  7^(°)  (3.19) 

i 

where  the  last  step  is  the  spectral  decomposition  of  St.  Using  Eq.  3.12, 

K,s  —  1  —  <5  +  (3Xi,A 
,5 


<  1  —  5  + [3—  (the  sufficiency  condition) 


(3 


<  1 


and  so,  A* , 


0  for  all  i  and  large  t 


Thus,  the  right-hand  side  of  Eq.  3.19  goes  to  zero,  implying  that 

p(t)  ^  0  as  t  increases 
implying  that  the  infection  dies  out  over  time. 


□ 
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Theorem  4  (Exponential  Decay).  When  an  epidemic  is  diminishing  therefore  [3/5  <  fo — ),  the 


probability  of  infection  decays  at  least  exponentially  over  time. 


Proof  We  have: 


p{t)  <  Y  A j'S*  Uj,s  Ui,s  m  (from  Eq  3.19) 


i 


<  xljSt  *  c 


(3.20) 


where  C  is  a  constant  vector  which  depends  on  the  entire  spectrum  of  eigenvalues  and  eigenvectors. 
Since  the  value  of  A1;s  is  less  than  1  (because  the  epidemic  is  diminishing),  the  values  of  p,  ( t )  are 
decreasing  exponentially  over  time.  The  exact  rate,  however,  depends  on  the  entire  spectrum  of 


eigenvalues. 


n 


Corollary  1.  NLDS  subsumes  the  KW  model  for  homogeneous  or  random  Erdos-Renyi  graphs. 

Proof.  According  to  the  KW  model  3.1,  the  epidemic  threshold  in  a  random  Erdos-Renyi  graph  is 
tKw  =  1  /(k),  where  ( k )  is  the  average  degree  [  iephart  and  White,  ].  It  is  easily  shown  that, 
in  a  homogeneous  or  random  network,  the  largest  eigenvalue  of  the  adjacency  matrix  is  ( k ) .  There¬ 
fore,  our  model  yields  the  same  threshold  condition  for  random  graphs,  and  thus,  our  NLDS  model 
subsumes  the  KW  model.  □ 

Corollary  2.  The  epidemic  threshold  tnlds  for  a  star  topology,  is  exactly  foj,  where  s/d  is  the 
square  root  of  the  degree  of  the  central  node. 

Proof.  The  eigenvalue  of  the  adjacency  matrix,  Ai,  is  simply  s/d.  Thus,  the  epidemic  threshold  is 
tnlds  =  Y  □ 

Corollary  3.  The  epidemic  threshold  for  an  infinite  power-law  network  is  zero.  This  matches 
results 

Proof.  In  a  power-law  network,  the  first  eigenvalue  of  the  adjacency  matrix  is  A  1,4  =  s/dmax , 
where  dmax  is  the  maximum  node  degree  in  the  graph  [Mihail  and  Papadimitriou,  2002].  Since 
dmax  OC  ln(N)  and  N  is  infinite,  Ah/\  is  infinite.  Our  epidemic  threshold  condition  states  that 
tnlds  =  l/Ai,^.  Therefore,  the  epidemic  threshold  is  effectively  zero  for  infinite  power-law 
networks.  This  result  concurs  with  previous  work,  which  found  that  infinite  power-law  networks 
lack  epidemic  thresholds  [  gnani,  2  ].  □ 

Corollary  4.  Below  the  epidemic  threshold  ( score  s  <  1),  the  expected  number  of  infected  nodes 
T)t  at  time  t  decays  exponentially  over  time. 


Proof. 


N 


vt  = 


=  Y;  Ai,gf  *  Cj 
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(from  Theorem  4) 


where  Ci  are  the  individual  elements  of  the  matrix  C  in  Equation  3.20.  Since  JT  Ci  is  a  constant 
and  Ai5s  <  1  (from  Theorem  1),  we  see  that  nt  decays  exponentially  with  time.  □ 


3.6  Summary 

Our  work  on  viral  propagation  has  two  main  contributions: 

1.  Given  a  graph,  we  found  an  epidemic  threshold  condition  below  which  an  infection  dies  out, 
but  above  which  it  may  survive  in  the  graph  for  a  long  time.  Surprisingly,  this  epidemic 
threshold  is  found  to  depend  on  only  one  property  of  the  graph:  its  largest  eigenvalue  Ai. 
Specifically,  the  infection  dies  out  when  (5/5  <  1/ Ai .  Thus,  higher  the  value  of  Ai ,  higher  the 
susceptibility  of  the  graph  to  viral  outbreaks.  This  result  is  an  important  design  consideration 
not  only  in  the  development  of  virus-resistant  computer  networks,  but  also  in  understanding 
the  spread  of  rumors,  or  fads,  or  the  effectiveness  of  information  dissemination  programs. 

2.  In  addition,  we  proposed  a  very  general  dynamical-systems  framework,  which  can  now  be 
used  in  other  similar  cases  where  propagation  depends  on  network  topology.  In  fact,  in 
Chapter  4,  we  will  demonstrate  the  applicability  of  this  methodology  in  calculating  the  “sur¬ 
vivability”  of  information  in  sensor  networks. 

An  important  avenue  of  future  work  is  using  this  threshold  condition  in  developing  good  im¬ 
munization  schemes,  to  maximally  disrupt  the  spread  of  a  virus. 
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Chapter  4 

Information  survival  in  sensor  and  P2P 
networks 


“Consider  a  piece  of  information  being  spread  within  a  sensor  or 
P2P  network  with  failing  links  and  nodes.  What  conditions  on  net¬ 
work  properties  determine  if  the  information  will  survive  in  the  net¬ 
work  for  long,  or  die  out  quickly? 


Sensor  and  Peer-to-peer  (P2P)  networks  have  recently  been  employed  in  a  wide  range  of  applica¬ 
tions: 


•  Oceanography:  Video  sensors  have  been  used  to  generate  panoramic  views  of  the  near-shore 
ocean  surface  [  lolman  et  ah,  2003]. 

•  Infrastructure  monitoring:  Software  sensors  have  been  used  to  monitor  and  collect  statistics 
on  various  bottleneck  points  in  computers,  such  as  CPU  load,  network  bandwidth,  and  so  on. 

•  Parking  Space  tracking:  This,  and  several  similar  applications,  have  been  developed  and 
deployed  under  the  IrisNet  framework  [  Tibbons  et  al ,  2002]. 

A  lot  of  work  has  also  been  done  recently  on  collating  information  from  many  different  sensors  and 
answering  user  queries  efficiently  [Madden  et  al.,  2003,  Garofalakis  and  Kumar,  2004,  Considine 
et  al.,  2004,  Nath  et  al ,  2004], 

We  look  at  the  problem  of  survivability  of  information  in  a  sensor  or  P2P  network  under  node 
and  link  failures.  For  example,  consider  a  sensor  network  where  the  communication  between 
nodes  is  subject  to  loss  (link  failures),  and  sensors  may  fail  (node  failures).  In  such  networks, 
we  may  want  to  maintain  some  static  piece  of  information,  or  “datum”,  which,  for  the  sake  of 
exposition,  we  refer  to  as  “Smith’s  salary”.  If  only  one  sensor  node  keeps  Smith’s  salary,  that  node 
will  probably  fail  sometime  and  the  information  will  be  lost.  To  counter  this,  nodes  that  have  the 
datum  can  broadcast  it  to  other  nodes,  spreading  it  through  the  network  and  increasing  its  chances 
of  survival  in  the  event  of  node  or  link  failures. 

Informally,  the  problem  we  solve  is: 
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PROBLEM:  Under  what  conditions  can  we  expect  Smith ’s  salary  to  sun’ive  in  the 
sensor  network? 

This  question  is  similar  to  the  one  we  asked  for  viral  propagation,  and  in  order  to  answer  it,  we 
must  surmount  the  same  issues.  Although  the  problem  definition  is  deceptively  simple,  there  is  no 
known  practical,  exact  solution.  The  obvious  one,  discussed  later  in  Section  4.2,  requires  Markov 
Chains,  and  is  prohibitively  expensive:  its  cost  is  proportional  to  3^,  where  N  is  the  number  of 
nodes.  For  a  network  of  N  ps  200  nodes,  the  effort  is  comparable  to  the  number  of  electrons  in  the 
universe. 

As  in  the  analysis  of  viral  propagation,  we  will  use  the  dynamical  systems  framework  to  solve 
the  current  problem.  We  will  find  a  threshold  below  which  the  information  is  expected  to  die  out 
quickly,  but  above  which  the  information  may  survive.  This  threshold,  once  again,  depends  only 
on  the  largest  eigenvalue  of  a  appropriately  constructed  square  matrix. 

The  layout  of  this  chapter  is  as  follows:  Section  4.1  discusses  some  background  work  related  to 
the  problem.  In  Section  4.2  gives  a  precise  problem  definition,  and  our  results  on  the  information 
survival  threshold  are  described  in  Section  4.3.  Section  4.4  details  experiments  showing  the  accu¬ 
racy  of  our  predictions.  We  provide  details  of  the  proofs  in  Section  4.5,  followed  by  a  summary  in 
Section  4.6. 


4.1  Related  Work 

Much  of  the  applicable  related  work  has  already  been  described  in  Section  3.1.  Here,  after  a  brief 
recap  of  those  ideas,  we  will  focus  on  prior  work  in  sensor  networking. 


4.1.1  Recap:  Propagation  over  a  network 

The  key  concepts  were: 

•  The  SIS  and  SIR  models:  These  describe  the  behavior  of  a  single  node  during  viral  prop¬ 
agation.  A  susceptible  (S)  node  can  be  attacked  and  infected  by  an  infected  neighbor.  An 
infected  (I)  node  can  be  cured  locally,  in  which  case  it  can  either  become  susceptible  again 
(in  the  SIS  model),  or  can  be  removed  (R)  and  achieve  immunity  (in  the  SIR  model).  The 
spread  of  information  in  a  sensor  network  is  similar  to  the  SIS  model:  a  node  can  get  the 
information  from  a  neighbor,  but  it  can  fail  and  lose  information.  When  it  is  replaced,  it 
becomes  “susceptible”  again. 

•  The  epidemic  threshold:  For  the  SIS  model,  we  found  a  closed-form  formula  for  the  epi¬ 
demic  threshold,  below  which  a  viral  outbreak  is  expected  to  die  out  exponentially  quickly, 
but  above  which  it  may  survive  for  a  long  time  (see  Chapter  3).  This  formula  is  surpris¬ 
ingly  simple:  it  depends  only  on  the  “birth”  and  “death”  rates  of  the  virus,  and  the  largest 
eigenvalue  of  the  network  topology. 
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4.1.2  Sensor  Networks  and  Gossip-based  Protocols 

Graphs  and  sensor  networks  have  attracted  a  lot  of  interest  lately,  for  quick  and  efficient  aggre¬ 
gation  of  information  [Garofalakis  and  Kumar,  2004,  Considine  et  al.,  2004],  for  understanding 
“trust”  and  “distrust”  in  online  social  networks  [  aha  et  al., .  ],  and  in  several  other  areas. 

Large  sensor  and  P2P  networks  typically  have  dynamic  node  populations  with  a  high  rate  of 
node  turnover  (called  high  churn  [Rhea  et  al.,  2004]),  and  “gossip”  protocols  have  recently  been 
used  to  handle  such  situations.  These  protocols  spread  information  using  a  (possibly  random)  net¬ 
work  of  connections  between  nodes,  as  in  our  case,  and  have  proved  useful  in  reliable  multicast  and 
broadcast  [Levis  et  al.,  2004,  Gupta  et  al.,  2002,  Luo  et  al.,  2003,  Birman  et  al.,  1999],  resource  lo¬ 
cation  [Kempe  et  al.,  2001,  Renesse,  2000],  failure  detection  [Wang  and  Kuo,  2003,  Renesse  et  al., 
1998,  Sistla  et  al.,  2001,  Bums  et  al.,  1999,  Zhuang  et  al.,  2005],  database  aggregation  [Kempe 
et  al.,  ],  database  and  peer-to-peer  replication  [Demers  et  al ,  198"/],  and  ensuring  the  stability 
of  dynamic  hash  table -based  peer-to-peer  systems  [  pta  et  a’  ,  3,  Rhea  et  al.,  ].  Several 

empirical  and  theoretical  studies  of  the  rates  of  propagation  and  convergence  of  gossip  protocols 
have  also  been  made  [Boyd  et  al.,  Ganesan  et  al.,  2002,  Levis  et  al.,  2004,  Kempe  et  al.,  2001, 
Kempe  and  Kleinberg,  2002]. 

However,  they  all  assume  that  the  initial  infection  or  rebroadcast  rate  is  high  enough  that  dying 
out  is  not  a  concern.  With  the  rise  of  sensor  and  peer  to  peer  networks  characterized  by  high  chum, 
theory  that  describes  the  survivability  of  data  in  such  networks  is  increasingly  important,  and  that 
is  the  central  problem  we  address. 


4.2  Model  for  Information  Propagation 

As  in  Chapter  3,  we  have  a  sensor/P2P/social  network  of  N  nodes  (sensors  or  computers  or  people) 
and  E  directed  links  between  them.  Our  analysis  assumes  very  small  discrete  timesteps  of  size  At, 
where  At  — >  0.  Within  a  At  time  interval,  each  node  i  has  a  probability  r,  of  trying  to  broadcast 
its  information  every  timestep,  and  each  link  i  —>  j  has  a  probability  f3i3  of  being  “up”,  and  thus 
correctly  propagating  the  information  to  node  j.  Each  node  i  also  has  a  node  failure  probability 
Si  >  0  (e.g.,  due  to  battery  failure  in  sensors).  Every  dead  node  j  has  a  rate  jj  of  returning  to 
the  “up”  state,  but  without  any  information  in  its  memory  (e.g.,  due  to  the  periodic  replacement  of 
dead  batteries).  These  and  other  symbols  are  listed  in  Table  4.1. 

The  system  is  a  Markov  chain  with  3;V  states;  each  state  corresponds  to  one  possible  network 
configuration,  with  each  of  the  N  nodes  being  in  one  of  three  states:  “Has  Info,”  “No  Info,”  or 
“Dead.”  There  is  a  set  of  “absorbing”  states  (where  no  node  “Has  Info”),  which  can  be  reached 
from  any  starting  state.  Thus,  the  information  will  die  out  with  probability  1.  However,  from  a 
practical  point  of  view,  in  some  parameter  combinations  this  extinction  happens  quickly,  while  in 
others  it  is  expected  to  be  longer  than  the  age  of  the  universe  (for  large  graphs)  -  in  the  latter  cases, 
the  datum  practically  ‘survives’. 

Definition  13  (Fast  Extinction).  ‘  ‘Fast  extinction  ”  is  the  setting  where  the  number  C (t)  of  “car¬ 
riers”  (i.e.,  nodes  in  “Has  Info”  state)  decays  exponentially  over  time  ( C(t )  oc  c_t,  c  >  1). 

Now,  we  formally  state  our  problem: 
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Symbol 

Description 

N 

Number  of  nodes  in  the  network 

fiij 

Link  quality:  Probability  that  link  i  — >  j  is  up 

Si 

Death  rate:  Probability  that  node  %  dies 

7 i 

Resurrection  rate:  Probability  that  node  i  comes  back  up 

Ti 

Retransmission  rate:  Probability  that  node  i  broadcasts 

Pif) 

Probability  that  node  i  is  alive  at  time  t  and  has  info 

qft) 

Probability  that  node  i  is  alive  at  time  t  but  without  info 

i 

C-K 

1 

Probability  that  node  i  is  dead 

(ft) 

Probability  that  node  i  does  not  receive  info  from  any 
of  its  neighbors  at  time  t 

P(t),  q(t) 

Probability  column  vectors: 

Pf)  =  {pft),p2(t),...,pN(t))', 
q(t)  =  ( qi{t),q2{t),...,qN{t ))' 

f:  Tf2N  -*•  Tf2N 

Function  representing  a  dynamical  system 

V(f) 

The  Jacobian  matrix  of  f(.) 

s 

The  N  x  N  “system”  matrix 

As 

An  eigenvalue  of  the  S  matrix 

A1,S 

The  largest  eigenvalue  (in  magnitude)  of  S 

s  =  lAi,sl 

“Survivability  score”  =  Magnitude  of  Xl  g 

Table  4.1:  Table  of  symbols 


PROBLEM: 

•  Given:  the  network  topology  ( link  “up”  probabilities )  3l]}  the  retransmission 
rates  ri,  the  resurrection  rates  yt  and  the  death  rates  5,  (i  —  1 ...  N,  j  — 
1...N) 

•  Find  the  condition  under  which  a  datum  will  suffer  “fast  extinction.” 

To  simplify  the  problem  and  to  avoid  dependencies  on  starting  conditions,  we  consider  the  case 
where  all  nodes  are  initially  in  the  “Has  Info”  state. 

As  in  analysis  of  viral  propagation  (Chapter  3),  we  will  convert  this  into  a  dynamical  system, 
and  answer  questions  in  that  system.  Let  the  probability  of  node  i  being  in  the  “Has  Info”  and  “No 
Info”  states  at  time-step  t  be  pft)  and  qft)  respectively.  Thus,  the  probability  of  its  being  dead 
is  (1  —  pft)  —  qft)).  Starting  from  state  “No  Info”  at  time-step  t  —  1,  node  i  can  acquire  this 
information  (and  move  to  state  “Has  Info”)  if  it  receives  a  communication  from  some  other  node  j. 
Let  (ft)  be  the  probability  that  node  i  does  not  receive  the  information  from  any  of  its  neighbors. 
Then,  assuming  the  neighbors’  states  are  independent: 

C,(«)  =  nf=1(l-rAft(«-l))  (4.1) 

For  each  node  i,  we  can  use  the  transition  matrix  in  Figure  4.1  to  write  down  the  probabilities 
of  being  in  each  state  at  time  t,  given  the  probabilities  at  time  t  —  1  (recall  that  we  use  very  small 


30 


Figure  4.1:  Transitions  for  each  node:  This  shows  the  three  states  for  each  node,  and  the  probabil¬ 
ities  of  transitions  between  states. 


timesteps  At,  and  so  we  can  neglect  second-order  terms).  Thus: 


Pi(t) 

=  pft  -  1)  (1  -  Si) 

+qi(t-l)(l-Ci(t)) 

(4.2) 

Qi{t) 

=  Qi(t  ~  1)  (Ci(t)  ~  Si) 

+  (1  -  Pi(t  -  1)  -  qi(t  ~  1))  7 i 

(4.3) 

From  now  on,  we  will  only  work  on  this  dynamical  system.  Specifically,  we  want  to  find  the 
condition  for  fast  extinction  under  this  system. 


4.3  Information  Survival  Threshold 

Define  S  to  be  the  N  x  N  system  matrix: 

J  1  —  Si  if  i  —  j 

~  |  rjfdji  otherwise  (4 -4-) 

Let  |  A1  g  |  be  the  magnitude  of  the  largest  eigenvalue  (in  magnitude).  Let  C(t)  to  be  the  expected 
number  of  carriers  at  time  t  according  to  this  dynamical  system;  C(t )  =  YliLi  Pi{i)- 

Theorem  5  (Condition  for  fast  extinction).  Define  s  =  |A1  g|  to  he  the  “survivability  score”  for 
the  system.  If 

s  =  I'Vsl  < 1 

then  we  have  fast  extinction  in  the  dynamical  system,  that  is,  C  (t)  decays  exponentially  quickly 
over  time. 

Proof.  The  theorem  is  proved  later  in  Section  4.5.  □ 

Definition  14  (Threshold).  We  will  use  the  term  “below  threshold”  when  s  <  1,  “above  thresh¬ 
old”  when  s  >  1,  and  “at  the  threshold”  for  s  —  1. 
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Corollary  5  (Homogeneous  case).  If  6,  —  5,  r,;  =  r,  yt  =  7 for  all  i,  and  B  =  \fffi  is  a  symmetric 
binary  matrix  (links  are  undirected,  and  are  always  up  or  always  down),  then  the  condition  for  fast 
extinction  is: 


5(7+5) A i.B  <  1 


One  way  to  visualize  this  is  to  think  of  the  spread  of  information  as  a  random  walk  on  the  sensor 
network.  The  information  “traverses  one  hop”  according  to  the  probabilities  from  the  B  matrix, 
and  thus  “traverses  n  hops”  according  to  B".  But  B"  grows  as  A™g  (indeed,  this  is  the  basis  of  the 
“power-method”  for  finding  the  largest  eigenvalue  of  a  matrix).  This  growth  is  increased  by  higher 
retransmission  rate  r,  but  the  information  can  die  out  with  probability  5,  which  reduces  the  spread. 
Also,  a  “target”  node  is  not  “Dead”  only  — ^  fraction  of  the  time,  so  the  information  can  spread 

only  so  fast.  Thus,  the  survivability  score  B  rcPrcscnts  the  rate  of  spread,  taking  all  these 

effects  into  account.  If  this  is  less  than  one,  the  information  is  not  spreading  quickly  enough,  and 
becomes  extinct  quickly. 


Theorem  5  has  several  corollaries  and  special  cases,  which  are  of  considerable  interest.  For 
clarity,  we  only  consider  homogeneous  undirected  graphs  here  (as  in  Corollary  5),  and  proofs  can 
be  obtained  by  application  of  the  Theorem. 


Corollary  6.  Consider  two  scenarios  on  the  same  network  with  all  parameters  the  same  except 
for  varying  rates  7,  and  7  2for  dead  nodes  coming  “up”  again.  Suppose  7,  >  y2.  Then,  the  first 
system  has  higher  survivability  score. 


Proof.  Si  =  — — 


5(fA5)Al-B  -  Ai-B  >  Ai.B  -  tAb  - 

Corollary  7.  Our  model  subsumes  the  SIS  model  of  viral  infection  as  a  special  case. 


□ 


Proof.  The  SIS  model  consists  has  only  two  states  per  node:  “Has  Infection”  and  “No  Infection.” 
In  our  model,  we  can  increase  7  so  that  a  “dead”  node  comes  back  “up”  very  quickly,  giving  the 
appearance  of  only  two  states:  “Has  Info”  and  “No  Info”.  In  this  situation,  7  is  large  (7  ~  1) 
but  Si  is  small  (due  to  small  timesteps).  From  Corollary  5,  the  fast-extinction  condition  becomes 
5A  g  <  1  (recall  that  the  matrix  B  =  [/%]).  This  is  exactly  the  epidemic  threshold  condition  for 
the  SIS  model  [Wang  et  al.,  2003,  Ganesh  et  al.,  2005].  □ 

Corollary  8.  If  the  “dead”  nodes  are  never  replaced  (7  =  0 ),  or  if  the  nodes  do  not  transmit  at  all 
(r  =  0),  we  always  have  fast  extinction. 


Proof,  s  =  |Al  S 


7  r  \ 
5(7+5) 


0  for  7  =  0  or  r  =  0.  Hence,  we  always  have  fast  extinction. 

□ 


Corollary  9.  Consider  two  networks  with  link  “up” probabilities  given  by  matrices  Bi  and  B2.  If 
|  Ax  bJ  >  A !  B  |,  then  the  first  network  has  higher  survivability  score. 

Fro°f-  s ■  =  j^jA  b,  >  Ji+^A.B,  =  □ 
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Corollary  10  (P2P  resilience).  A  ‘  ‘star”  network  has  higher  sun’ivability  score  than  a  “ring” 
network  with  the  same  number  of  nodes. 

Proof.  |A1  p>  | st„r  =  s/N  —  1  >  2  =  |A,  g| ring.  From  Corollary  9,  the  “star”  network  has  higher 
survivability  score.  □ 


4.4  Experiments 

We  want  to  answer  the  following  questions: 

(QD  How  close  is  our  approximation  technique  to  the  “truth”,  for  both  real  and  synthetic  graphs? 

(Q2)  How  accurate  is  our  threshold  condition  (Theorem  5)? 

The  datasets  used  were: 

•  GRID :  This  is  a  large  synthetic  2D  grid.  Link  “up”  probabilities  (/ 3^- )  are  fixed  at  0. 1  between 
all  neighbors  on  the  grid. 

•  GNUTELLA :  This  is  a  snapshot  of  the  Gnutella  peer-to-peer  file  sharing  network,  collected 
in  March  2001  [Ripeanu  et  ah,  2002].  Link  “up”  probabilities  are  set  to  0.1  for  existing 
edges. 

•  INTEL :  This  is  a  54-node  sensor  network  observed  over  a  period  of  33  days  [  nte  ].  Link 
probabilities  were  derived  from  the  collected  data.  The  nodes  were  deployed  in  a  lab  with 
a  rectangular  shape  and  “soft”  walls  which  can  be  penetrated  by  radio  signals  (see  Fig¬ 
ure  4.3(a)),  leading  to  high  average  node  degree  (~  46)  in  the  network.  The  link  qualities 
/3ij  are  smeared-out  over  the  entire  range,  as  shown  in  Figure  4.2(a).  The  average  link  quality 
(considering  only  the  links  with  non-zero  link  quality)  is  very  low  (0.14). 

•  MIT:  This  is  a  40-node  sensor  network  at  MIT  (see  [  nil  et  al.,  ]  for  an  earlier  version  of 
the  network).  Figure  4.3(b)  shows  the  placement  of  sensors.  The  central  region  blocks  radio 
signals,  creating  an  elongated  “corridor”  of  sensor  communications  around  it.  This  implies 
low  average  node  degree  (~  18);  however,  the  link  quality  distribution  is  very  peaked,  as 
shown  in  Figure  4.2(b).  This  leads  to  a  high  value  of  0.92  for  the  average  link  quality 
(again  only  considering  the  non-zero  quality  links).  Note  that  these  conditions  are  the  exact 
opposite  of  what  we  see  for  the  INTEL  dataset. 


4.4.1  (Ql)  Accuracy  of  the  dynamical  system 

For  each  of  the  datasets,  the  parameters  were  set  so  that  the  system  was  below,  above  and  at  the 
threshold  (that  is,  the  survivability  score  s  was  less  then,  equal  to  or  greater  than  1)  according  to 
Theorem  5.  Table  4.2  shows  these  parameter  values.  All  nodes  initially  carry  the  information.  The 
simulation  is  then  run  according  to  the  state  diagram  shown  in  Figure  4.1  for  10,  000  steps,  and 


33 


Link  quality 


(a)  INTEL  link  qualities 


1000 


Link  quality 


(a)  MIT  link  qualities 


Figure  4.2:  Link  quality  distributions:  Plots  (a)  and  (b)  plots  the  number  of  links  versus  link 
quality.  Pairs  of  sensors  which  cannot  communicate  with  each  other  have  a  link  quality  of  0. 
While  the  INTEL  distribution  shows  a  broad  range  of  link  qualities,  the  MIT  distribution  is  very 
highly  peaked. 


(a)  INTEL  sensor  map 


(b)  MIT  sensor  map 


Figure  4.3:  Sensor  placement  maps:  Black  numbered  dots  for  INTEL ,  and  black  dots  for  MIT. 


the  number  of  carriers  (nodes  with  information)  over  the  epochs  of  simulation  was  recorded.  Each 
simulation  was  repeated  100  times. 

Figure  4.4  shows  the  number  of  carriers  over  time  (only  200  simulation  epochs  are  shown  for 
ease  of  visualization;  the  results  are  similar  over  10,  000  timesteps).  Each  dataset  is  investigated 
under  three  scenarios:  above,  at  and  below  our  threshold  condition.  For  each  scenario,  we  show  the 
simulation  result  in  solid  lines,  along  with  its  confidence  interval  (one  standard  deviation).  While 
the  confidence  intervals  are  very  small  for  GRID  and  GNUTELLA ,  the  INTEL  and  MIT  datasets 
show  wider  confidence  intervals:  this  is  due  to  their  small  sizes.  In  addition  to  the  simulations,  we 
ran  our  dynamical  system  (Equations  4. 2-4. 3)  using  exactly  the  same  parameters;  the  number  of 
carriers  in  this  case  is  shown  in  dotted  lines.  We  make  the  following  observations: 

•  The  dynamical  system  is  very  accurate:  The  dotted  lines  of  our  dynamical  system  are  almost 
indistinguishable  from  the  solid  lines  of  the  simulation  (relative  error  is  just  around  1%). 
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—  Simulation 

Dynamical  system 


above  threshold 
at  threshold 


50  100  150  200 

Simulation  epochs 


(a)  GRID 


(c)  INTEL 


x  10 


(b)  GNUTELLA 


(d)  MIT 


Figure  4.4:  Number  of  carriers  versus  time  (simulation  epochs ):  Each  plot  shows  the  evolution 
of  the  dynamical  system  (dotted  lines)  and  the  simulation  (solid  lines).  Confidence  intervals  are 
shown  for  the  simulation  results.  Three  cases  are  shown  for  each  dataset:  below  threshold,  at  the 
threshold,  and  above  the  threshold,  with  parameter  settings  as  shown  in  Table  4.2.  There  are  two 
observations:  (1)  The  dynamical  system  (dotted  lines)  is  very  close  to  the  simulations  (solid  lines), 
demonstrating  the  accuracy  of  Equations  4. 2-4. 3.  (2)  Also,  the  number  of  carriers  dies  out  very 
quickly  below  the  threshold,  while  the  information  “survives”  above  the  threshold. 
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Dataset 

w.r.t. 

threshold 

5 

7 

r 

Survivability 

s 

below 

0.1 

0.01 

0.1 

0.90 

GRID 

at 

0.01 

0.004 

0.1 

1.001 

above 

0.01 

0.1 

0.1 

1.02 

below 

0.1 

0.01 

0.1 

0.91 

GNUTELLA 

at 

0.07 

0.004 

0.1 

1.003 

above 

0.01 

0.01 

0.1 

1.05 

below 

0.1 

0.01 

0.1 

0.96 

INTEL 

at 

0.02 

0.0006 

0.1 

1.0003 

above 

0.01 

0.01 

0.1 

1.33 

below 

0.15 

0.01 

0.1 

0.96 

MIT 

at 

0.05 

0.0006 

0.1 

1.01 

above 

0.01 

0.01 

0.1 

1.88 

Table  4.2:  Parameter  settings  for  the  datasets. 


(a)  GRID  (b)  GNUTELLA  (c)  INTEL  (d )  MIT 

Figure  4.5:  Number  of  carriers  after  a  “long”  time,  versus  the  retransmission  probability:  The 
dashed  vertical  line  shows  our  threshold  (s  =  1).  The  information  dies  out  below  our  threshold, 
but  survives  above  it. 


Thus,  equations  4. 2-4. 3  are  highly  accurate  for  a  wide  variety  of  real-world  scenarios. 

•  The  information  dies  out  below  our  threshold:  For  all  the  datasets,  the  number  of  carriers 
goes  to  zero  very  quickly  below  the  threshold.  Thus,  the  information  does  not  survive. 

•  Above  our  threshold,  the  number  of  carriers  practically  stabilizes:  In  other  words,  the  infor¬ 
mation  survives  for  a  “long”  time. 


4.4.2  (Q2)  Accuracy  of  the  threshold  condition 

In  this  set  of  experiments,  we  modify  one  parameter  while  keeping  all  the  others  fixed.  The  link 
qualities  depend  on  the  environment,  while  the  death  rate  5  is  intrinsic  to  the  sensor  and 
its  battery;  thus,  we  only  perform  experiments  that  modify  the  retransmission  rate  r  and  the 
resurrection  rate  7.  For  each  dataset,  we  run  simulations  for  many  values  of  r  and  7,  and  count 
the  number  of  carriers  left  after  a  “long”  time  (1,000  simulation  epochs).  For  each  setting,  100 
simulation  runs  are  performed  to  provide  confidence  intervals. 
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(a)  GRID  (b)  GNUTELLA  (c)  INTEL  (d )  MIT 

Figure  4.6:  Number  of  carriers  after  a  “long”  time,  versus  the  resurrection  probability:  The 
dashed  vertical  line  shows  our  threshold.  Again,  our  threshold  is  very  accurate. 


Varying  retransmission  rate  r 

For  the  purpose  of  this  experiment,  the  death  rate  8  and  the  resurrection  rate  7  were  set  to  0.01. 
Figure  4.5  shows  the  number  of  carriers  left  after  a  “long”  time,  versus  different  values  of  the 
retransmission  rate  r,  on  all  four  datasets.  The  dashed  vertical  line  marks  the  value  of  r  which 
leads  to  a  survivability  score  of  s  =  1  according  to  Theorem  5,  that  is,  the  “at-the-threshold” 
scenario.  INTEL  and  MIT  have  higher  variance  (and  hence  wider  confidence  intervals)  due  to 
their  small  sizes.  We  observe  the  following: 

•  Below  our  threshold,  the  information  has  died  out:  The  number  of  carriers  is  either  zero  or 
very  close  to  zero  for  all  the  datasets.  This  is  exactly  in  accordance  with  Theorem  5. 

•  Above  the  threshold,  the  information  survives:  Even  after  a  “long”  time,  there  is  a  significant 
population  of  carriers  in  the  network. 

•  Thus,  the  threshold  condition  is  very  accurate. 


Varying  resurrection  rate  7 

Now,  the  resurrection  rate  7  is  varied,  while  keeping  the  retransmission  rate  r  fixed  at  0.1  and  the 
death  rate  5  at  0.01.  Figure  4.6  shows  the  results.  The  dashed  vertical  line  shows  our  threshold 
value.  Once  again,  we  can  see  that  for  all  the  datasets,  the  information  dies  out  below  the  threshold, 
but  survives  above  the  threshold.  Thus,  our  threshold  is  accurate. 


4.5  Details  of  proofs 

Theorem  2  (Condition  for  fast  extinction).  Define 
s  =  | A !  s  to  be  the  “survivability  score”  for  the  system.  If 

s  =  lAi,sl  <  1 

then  we  have  fast  extinction  in  the  dynamical  system,  that  is,  C  (t)  decays  exponentially  quickly 
over  time. 
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Proof.  The  theorem  follows  from  Lemma  3  and  Theorems  3  and  4  below.  We  will  first  show  that 
the  scenario  with  no  information  survival  (pft)  =  0)  forms  a  fixed  point  of  the  dynamical  system. 
Then,  we  show  that  when  s  <  1,  this  fixed  point  is  asymptotically  stable  under  small  perturbations 
(this  is  how  we  derived  the  condition  in  this  theorem).  Finally,  we  show  that  our  threshold  is 
insensitive  to  the  starting  state:  when  s  <  1,  pft)  — >  0  and  thus  Cfit)  —>  0  exponentially  quickly. 

□ 

Definition  15  (Asymptotic  Stability  of  a  Fixed  Point).  A  fixed  point  Pf  is  “asymptotically  stable  ” 
if,  on  a  slight  perturbation  from  Pf,  the  system  returns  to  Pf  (as  against  moving  away,  or  staying 
in  the  neighborhood  of  Pf  but  not  approaching  it)  [  lirsch  and  Smale,  1974], 

Lemma  3  (Fixed  Point).  The  values  (pft)  =  0,  qft)  =  -  ^  j  for  all  nodes  i,  are  a  fixed  point 
of  Equations  4.2-43.  Proved  by  a  simple  application  of  the  Equations. 

Theorem  3  (Stability  of  the  fixed  point).  The  fixed  point  of  Lemma  3  is  asymptotically  stable  if 
the  system  is  below  the  threshold,  that  is,  s  —  |A1  g|  <  1. 

Proof.  First,  we  define  the  column  vector  p(t )  = 

(pi(t),p2(t), . . .  ,pN(t))' ■  Define  q(t)  similarly.  Let  v(t)  =  (p{t),  q(t))  be  the  concatenation  of 
these  two  vectors,  and  Vf  be  the  vector  v(t)  at  the  fixed  point  defined  in  Lemma  3.  Then,  the 
entire  system  can  be  described  as: 


v(t) 

fi  -  1)) 


f  (v(t  —  1))  where 
pft  -  1)  (1  -  Si) 

+®(t-i)(i-Ci(t)) 

< 

qfit  -  1)  (C i(t)  -  Si) 

,  +(1  1)  -  qft  ~  l))ji 


if  i<N 


if?;  >  N 


(4.5) 


where  pft  —  1)  and  qfit  —  1)  are  the  corresponding  entries  in  v(t  —  1). 

To  check  for  the  asymptotic  stability  of  a  fixed  point,  we  can  use  the  following  condition. 


Lemma  4.  (From  [  lirsch  and  Smale,  1974])  Define  V(f)  (also  called  the  Jacobian  matrix)  to  be 
a  2N  x  2N  matrix  such  that 


[V(f)]„ 


d/i  (g(t  ~  1)) 
dvj(t  —  1) 


(4.6) 


Then,  if  the  largest  eigenvalue  (in  magnitude )  of  V(f)  at  vj-  (written  V(f)|^>  )  is  less  than  1  in 
magnitude,  the  system  is  asymptotically  stable  at  Vf.  Also,  if  f  is  linear  and  the  condition  holds, 
then  the  dynamical  system  will  exponentially  tend  to  the  fixed  point  irrespective  of  initial  state. 


Applying  this  to  our  dynamical  system  (Equation  4.5),  we  get  a  block-matrix  form  for  V(f )  | 


vf 


V(f)l  fif 


S  0 

Si  s2 


(4.7) 
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Here,  all  of  S,  Si  and  S2  are  N  x  N  matrices,  defined  as: 


Sy  = 


^  1 1] 

^2  ij  ~ 


1  -Si 


-7f 

-riP. 


......  7i 

J^7»+^i 


if  i  =  J 

otherwise 

if  *  =  J 

otherwise 


as  in  Eqn.  4.4 


1  7t  ~  <5*  if  i  =  J 

0  otherwise 


(4.8) 

(4.9) 


In  order  to  check  for  asymptotic  stability  (Lemma  4),  we  must  find  the  largest  eigenvalues  (in 
magnitude)  ofV(f)|^  .  Let  if  be  an  eigenvector  (of  2 A'  elements)  of  V(f)|^  ,  with  the  form 


x  = 


X\ 

X2 


where  both  x\  and  x2  are  vectors  of  length  N.  Thus, 


V(f)|j7/f  = 


s 

0 

X\ 

L  s. 

S2  J 

.  *2  _ 

which  implies,  Say 


and,  S1T1  +  S2X2 


AV(f%f 


Xl 

X-2 


AV(f)|t7/* 


(4.10) 

(4.11) 


The  only  possible  solutions  are: 


•  Case  1:  Ay^|  ^  is  an  eigenvalue  of  S:  This  is  implied  by  Equation  4.10. 

•  Case  2:  (1  —  7,  —  is  an  eigenvalue  of  S,  for  all  i:  This  happens  if  x\  =  0,  and  follows 
from  Equation  4.11. 


The  largest  eigenvalue  from  Case  2  is  always  less  than  1  in  magnitude  (since  5i  >  0).  The  largest 
eigenvalue  (in  magnitude)  from  Case  1  is  exactly  A:  g.  Thus,  if  I A1  g  |  <  1,  then  from  Lemma  4, 
the  system  is  asymptotically  stable.  □ 

Theorem  4  (Insensitivity  to  the  starting  state).  If  |A,  g  <  1,  then  pft)  — >  0  exponentially 
quickly  for  all  nodes  i,  irrespective  of  the  starting  state. 

Proof  We  will  prove  this  in  three  parts: 

1.  pft)  +  qft)  — >  "C  exponentially  quickly  for  all  i.  Since  both  pft)  and  qft)  are  non- 

yj+Oj 

negative  (they  are  probabilities),  this  implies  that  either  (1)  qft)  —>  and  thus  pft)  —> 
0,  OR  (2)  qft)  <  — ^V.  One  of  these  two  situations  occurs  exponentially  quickly. 

rYiJpVi 
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2.  If  qfT)  <  at  any  time  t  —  T,  then  qft)  <  3^  for  all  t  >  T. 

3.  Under  the  condition  that  <&(£)  <  and  |A,g|  <  1,  the  dynamical  system  of  Equa- 

y'j+Oj  i 

tions  4. 2-4. 3  tends  to  the  fixed  point  (pft)  =  0,  qft)  =  exponentially  quickly. 

Thus,  when  \X1  g|  <  1,  we  always  end  up  with  pft)  — >  0  for  all  i,  and  this  happens  exponentially 

quickly.  So,  the  expected  number  of  nodes  with  information  (7(f)  =  YliLi  Pitt)  0  exponentially 
quickly.  All  these  three  parts  are  proven  below.  □ 

Lemma  5.  pft)  +  q,  ( t )  — >  exponentially  quickly  for  all  i. 

'Yi+Oi 

Proof.  From  Equations  4.2  and  4.3,  we  can  get: 


1  -  Pitt)  -  Qitt)  =  Pitt  -  +  Qitt~ 

+  (l-Pi(f-l)-gi(f-l))(l«7i)  (4-12) 

Let  Xi(t)  =  Pitt)  +  Qitt)-  Then,  from  Equation  4.12, 


1  -  Xi(t)  =  xft  -  l)5i 

+(1  -  xft  -  1))(1  -  7i) 

and  so,  xft)  =  7*  +  xft  ~  1)(1  -  7*  ~  $i)  (4T3) 


Equation  4.13  is  a  linear  dynamical  system.  A  simple  application  of  Lemma  4  shows  that  the 
system  converges  to  the  only  fixed  point  of 


xft)  = 

7 i 

7*  + 

that  is,  p  ft)  +  qft)  = 

7* 

7*  +  <5* 

(4.14) 

Again,  by  Lemma  4,  this  convergence  is  exponentially  quick. 

□ 

Lemma  6.  IfqfT)  <  af  any  pme  t  =  T,  then  qft )  <  —^r  for  all  t  >  T. 

7i+A  7 

Proof  We  prove  this  by  induction. 

Base  Case:  qfT )  < 

Assumption:  Suppose  <&(£  —  1)  <  Ty  at  some  time  instant  f  —  1. 
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Induction:  Applying  Equation  4.3, 


Qi(t)  =  Qi(t-  1)  ((fit)  -Si) 

+  (1  -  pfit  -  1)  -  qfit  -  1))  7 i 

<  Qi(t  -  1)(1  -  Si)  +  (1  -  qfit  -  l))ji 

(since  (fit)  <  1  and  pfit  —  1)  >  0) 

<  li  +  Qi(t  -  1)  (1  -  7f  -  <*») 
li 


<7 i  + 


,c  (l-7i-<y 
7t  +  <7 

(from  the  Assumption) 


< 


7 i 


□ 


7i  + 

Thus,  g.j(f)  <  — for  all  time  t  >  T. 

li+Oi 

Lemma  7.  Under  the  condition  that  qfit)  <  one/  |A1  g|  <  1,  f/re  dynamical  system  of  Equa¬ 
tions  4.2-4. 3  tends  to  the  fixed  point  (pfit)  =  0,  q,  ( t )  =  ^  j  exponentially  quickly. 


7 i 


Proof.  We  can  use  g*(f)  <  in  Equation  4.2  to  get: 


Pi{t)  <  Pi{t  -  1)  (1  -  Si)  + 


li 


li  + 


a -cm 


(4.15) 


Now,  we  have  the  following  inequality  regarding  (fit): 


C fit)  =  (i  -  r^jipfit  - 1)) 
>  1  -  Ef^rjPjiPjit  -  1) 

N 


and  thus,  (1  -  (ft))  <  T,Jf=1rj(3jipj(t  -  1) 


(4.16) 


Using  Equation  4.16  in  4.15: 


o  <Pi(t)  <  Pi(t  -  1)  (1  -  Si)  + 


li 


li  +  Si 


>7  \rjjjd>;fit  -  1) 


Converting  to  the  matrix  form,  we  see 


0  <  p{t)  <  S p(t  —  1)  (from  Equation  4.4) 

<  S 2p(t  -  2)  <  . . . 

<  sy0 


(4.17) 


What  is  S'pqI  Suppose  we  had  equalities  instead  of  “less  than”  signs  in  the  previous  equations. 
Then,  we  would  have  a  linear  dynamical  system:  p(t)  =  S p(t  —  1)  =  S 2p(t  —  2)  =  . . .  =  Stp0. 
We  could  apply  Lemma  4  to  this  system:  |A1  g|  <  1  (from  the  statement  of  this  Lemma)  so  the 

system  would  be  stable  and  would  converge  to  the  fixed  point  p(t)  =  0  exponentially  quickly. 
Thus,  SVo  — >  0  exponentially  quickly. 
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Combining  this  with  Equation  4.17,  0  <  p(t)  <  S*po  — 5 ►  0,  and  thus,  p(t)  — >  0.  So,  pi(t)  — >  0 
exponentially  quickly,  and  from  Lemma  5,  this  implies  that  q^t)  — >  — Thus,  we  reach  the 

yj+Oj 

fixed  point  (pi(t)  =  0,  q,(t)  =  exponentially  quickly,  irrespective  of  the  starting  state.  □ 

4.6  Summary 

Under  what  conditions  does  information  survive  in  a  sensor  or  P2P  network  with  node  and  link 
failures?  As  with  the  study  of  viral  propagation,  we  could  formulate  the  problem  as  a  dynami¬ 
cal  system,  showing  the  generality  of  this  framework.  The  dynamical  system  equations  lead  to  a 
threshold  condition  below  which  a  “datum”  is  expected  to  disappear  from  the  network  exponen¬ 
tially  quickly,  but  above  which  it  may  survive  for  a  long  time.  Several  corollaries  and  special  cases 
were  used  to  demonstrate  the  intuitiveness  of  this  result;  in  particular,  we  showed  that  the  epidemic 
threshold  for  the  SIS  model  of  viral  propagation  is  a  special  case  of  our  information  threshold  re¬ 
sult.  Experiments  on  a  variety  of  synthetic  and  real-world  datasets,  both  sensor  and  P2P  networks, 
show  the  accuracy  of  our  results. 
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Chapter  5 

Automatically  grouping  correlated  nodes 
using  Cross-Associations 


“How  can  we  automatically  find  natural  node  groups  in  a  large 
graph  ?  ” 

Clustering  is  one  of  the  most  common  methods  of  data  analysis,  and  it  has  many  applications  in 
graph  mining.  For  example,  given  a  list  of  customers  and  the  products  they  buy,  we  might  want  to 
find  customer  “groups,”  and  the  product  “groups”  they  are  most  interested  in.  This  “segmenting” 
of  the  customer  base  can  bring  out  the  major  customer  “behaviors,”  and  is  useful  in  visualizing  the 
data  at  a  glance. 

Clustering  has  applicability  in  a  wide  range  of  datasets.  Apart  from  the  customer-product 
example  above,  we  can  clusters  data  on: 

•  Users  versus  their  preferences:  With  the  growing  importance  of  user-personalization  on  the 
Web,  finding  user  groups  and  preference  groups  can  be  very  useful. 

•  Bank  accounts  versus  transactions:  Individual  bank  accounts  can  be  tagged  according  to  the 
“transaction  types”  that  they  engage  in.  Deviation  from  this  behavior  can  be  used  to  trigger 
fraud  detection  systems. 

•  Documents  versus  the  words  in  them:  Here,  we  could  find  clusters  of  document  groups  (say, 
science  fiction  novels  and  thrillers),  based  on  the  word  groups  that  occur  most  frequently  in 
them.  A  user  who  prefers  one  document  can  then  be  recommended  another  document  in  that 
group. 

•  Social  networks:  A  social  network  can  describe  relationships  between  individuals:  who 
trusts  whom,  who  meets  whom,  who  has  financial  transactions  with  whom,  and  so  on. 
Grouping  people  based  on  such  information  could  be  useful  in,  say,  detection  of  money 
laundering  rings  in  financial  transaction  networks. 

Apart  from  the  examples  above,  we  can  imagine  many  other  instances  where  graph  clustering 
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would  be  useful:  in  graph  partitioning  and  community  detection  on  the  Web,  in  collaborative 
filtering  applications,  in  microarray  data  analysis  and  so  on. 

Note  that  these  examples  include  both  bipartite  graphs  (such  as  users-vs-preferences)  and  self¬ 
graphs  (such  as  social  networks).  In  fact,  any  setting  that  has  a  many-to-many  relationship  in 
database  terminology  can  be  represented  as  a  graph,  so  a  graph  clustering  algorithm  would  be 
applicable  on  a  wide  variety  of  datasets.  We  focus  only  on  unweighted  graphs,  which  can  be 
represented  as  adjacency  matrices  with  0/1  entries  (see  Chapter  2).  Hence,  from  now  on,  we  use 
the  terms  “graph”  (with  nodes)  and  “matrix”  (with  rows  and  columns)  interchangeably. 

A  good  graph  clustering  algorithm  should  have  the  following  properties: 

•  (PI)  It  should  automatically  figure  out  the  number  of  clusters. 

•  (P2)  It  should  cluster  the  rows  and  columns  simultaneously. 

•  (P3)  It  should  be  scalable. 

•  (P4)  It  should  allow  incremental  updates. 

•  (P5)  It  should  apply  to  both  self-graphs  and  bipartite  graphs.  For  bipartite  graphs,  we  have 
row  and  column  clusters,  representing  node  groups  in  the  “from”  and  “to”  parts  of  the  bipar¬ 
tite  graph.  For  self-graphs,  we  might  have  an  additional  condition  that  the  row  and  column 
clusters  be  identical  (i.e.,  only  “node”  groups,  instead  of  “from”  and  “to”  groups). 

In  this  Chapter,  we  will  describe  an  algorithm  which  has  all  of  these  properties.  In  addition, 
the  same  clustering  algorithm  can  be  easily  extended  to  mine  the  data  even  further.  Specifically, 
the  goals  of  our  algorithm  are: 

•  (Gl)  Find  clusters. 

•  (G2)  Find  outliers. 

•  (G3)  Compute  inter-cluster  distances. 

Intuitively,  we  seek  to  group  rows  and  edges  (i.e.,  nodes)  so  that  the  adjacency  matrix  is  divided 
into  rectangular/square  regions  as  “similar”  or  “homogeneous”  as  possible.  The  homogeneity 
would  imply  that  the  graph  nodes  in  that  (row  or  column)  group  are  all  “close”  to  each  other,  and  the 
density  of  each  region  would  represent  the  strength  of  connections  between  groups.  These  regions 
of  varying  density,  which  we  call  cross-associations,  would  succinctly  summarize  the  underlying 
structure  of  associations  between  nodes. 

In  short,  our  method  will  take  as  input  a  matrix  like  in  Figure  5.1(a),  and  it  will  quickly  and 
automatically  (i)  determine  a  good  number  of  row  groups  k  and  column  groups  £  and  (ii)  re-order 
the  rows  and  columns,  to  reveal  the  hidden  structure  of  the  matrix,  like  in  Figure  5.1(e).  Then,  it 
will  use  the  structure  found  in  the  previous  step  to  automatically  find  abnormal  (“outlier”)  edges, 
and  also  to  compute  the  “distances”  between  pairs  of  groups. 

We  will  first  discuss  related  work  in  Section  5.1.  Then,  in  Section  5.2,  we  formulate  our  data 
description  model,  and  a  corresponding  cost  function  for  each  possible  clustering.  Based  on  this, 
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Figure  5.1:  Snapshots  of  algorithm  execution:  Starting  with  the  matrix  in  plot  (a),  the  clustering 
is  successively  refined  and  the  number  of  clusters  increased  till  we  reach  the  correct  clustering  in 
plot  (e). 

we  describe  our  automatic,  parameter-free  algorithm  for  finding  good  clusters  in  Section  5.3.  We 
then  use  these  clusters  to  find  outliers  and  compute  inter-cluster  distances  in  Section  5.4.  Ex¬ 
perimental  results  are  provided  in  Section  5.5.  Section  5.6  gives  detailed  proofs,  followed  by  a 
summary  in  Section  5.7. 


5.1  Related  Work 

There  are  numerous  settings  where  we  want  to  find  patterns,  correlations  and  rules,  and  time-tested 
tools  exist  for  most  of  these  tasks.  However,  with  a  few  exceptions,  all  require  tuning  and  human 
intervention,  thus  failing  on  property  (PI).  We  will  discuss  several  of  these  approaches  below. 

Clustering:  Traditional  clustering  approaches  group  along  one  dimension  only:  given  a  collection 
of  n  points  in  m  dimensions,  they  find  “groupings”  of  the  n  points.  This  setting  makes  sense  in 
several  domains  (for  example,  if  the  m  dimensions  have  an  inherent  ordering),  but  it  is  different 
from  our  problem  setting. 

Also,  most  of  the  algorithms  require  a  user-given  parameter,  such  as  the  number  of  clusters  k  in 
the  popular  k- means  approach.  The  problem  of  finding  k  is  a  difficult  one  and  has  attracted  atten¬ 
tion  recently;  examples  include  X-means  [Pelleg  and  Moore,  2  ],  which  uses  the  BIC  heuristic, 

and  G-means  [  lamerly  and  Elkan,  2003],  which  assumes  a  mixture  of  Gaussians  (an  often  reason¬ 
able  assumption,  but  which  may  not  hold  for  binary  matrices). 

There  are  many  other  recent  clustering  algorithms  too,  including  CURE  [  iha  et  al.,  1998], 
BIRCH  [Zhang  et  al.,  1996],  Chameleon  [Karypis  et  al.,  1999],  [Hinneburg  and  Keim,  1998], 
and  some  /;:- means  variants  such  as  fc-harmonic  means  [  'hang  et  al.,  ]  and  spherical  k- 
means  [Dhillon  and  Modha,  2001];  see  also  [Han  and  Kamber,  2000]. 

However,  they  often  need  k  as  user  input,  or  focus  on  clustering  along  one  dimension  only, 
or  suffer  from  the  dimensionality  curse  (like  the  ones  that  require  a  co-variance  matrix);  others 
may  not  scale  up  for  large  datasets.  Also,  in  our  case,  the  points  and  their  corresponding  vectors 
are  semantically  related  (each  node  occurs  as  a  point  and  as  a  component  of  each  vector);  most 
clustering  methods  do  not  consider  this. 
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Thus,  while  these  methods  are  applicable  in  many  settings,  they  do  not  match  our  required 
properties. 

Co-clustering:  Recent  work  on  Information-theoretic  Co-clustering  [  iillon  et  ah,  ]  (called 
ITCC  from  now  on)  is  more  closely  related  to  ours;  however,  the  differences  are  significant. 

•  ITCC  focuses  on  contingency  tables,  not  binary  matrices. 

•  The  main  idea  is  based  on  lossy  compression,  whereas  we  employ  a  lossless  compression 
scheme. 

•  ITCC  seeks  matrix  approximations  that  minimize  the  KL-divergence  to  the  original  matrix, 
while  we  employ  an  MDL-based  approach  and  a  model  that  describes  the  data  exactly. 

•  Finally,  ITCC  assumes  knowledge  of  the  number  of  clusters. 

As  of  this  writing,  it  is  not  clear  how  to  apply  MDL  philosophy  to  the  frameworks  in  [Dhillon 
et  al.,  2003,  Friedman  et  al.,  2001,  Hofmann,  1999];  in  contrast,  our  formulation  is  designed  from 
grounds-up  to  be  amenable  to  a  parameter-free  formulation. 

More  remotely  related  are  matrix  decompositions  based  on  aspect  models  (one  of  the  most 
prominent  being  PLSA  [  lofmann,  ],  which  again  minimizes  KL-divergence).  These  also 
assume  that  the  number  of  “aspects”  is  given. 

Graph  partitioning:  The  prevailing  methods  are  METIS  [  Carypis  and  Kumar,  19981  ],  and  spectral 
partitioning  [Andrew  Y.N  ,  ].  Both  approaches  have  attracted  a  lot  of  interest  and  attention; 

however,  both  need  the  user  to  specify  k,  that  is,  how  many  pieces  should  the  graph  be  broken  into. 
Some  methods  have  recently  been  suggested  to  find  k  [Ben-Hur  and  Guyon,  2003,  Tibshirani  et  al., 
2001  ],  but  they  need  to  sample  from  the  data  many  times  and  then  process/cluster  those  samples, 
which  can  become  very  slow  and  infeasible  for  large  datasets.  In  addition,  these  graph  partitioning 
techniques  typically  require  a  measure  of  imbalance  between  the  two  pieces  of  each  split. 

Several  other  partitioning  techniques  have  also  been  suggested.  The  Markov  Clustering  [van 
Dongen,  2000]  method  uses  random  walks,  but  is  slow.  Girvan  and  Newman  [Girvan  and  Newman, 
]  iteratively  remove  edges  with  the  highest  “stress”  to  eventually  find  disjoint  communities, 
but  the  algorithm  is  again  slow;  Clauset  et  al.  [  lauset  et  a  ,  10(  ]  suggest  a  faster  algorithm  but 
the  number  of  clusters  must  still  be  specified  by  the  user.  Flake  et  al.  [  lake  et  al.,  2000]  use  the 
max-flow  min-cut  formulation  to  find  communities  around  a  seed  node;  however,  the  selection  of 
seed  nodes  is  not  fully  automatic. 

Market-basket  analysis  / frequent  itemsets:  These  have  been  very  important  data  mining  areas,  and 
have  attracted  a  lot  of  research  [Agrawal  and  Srikant,  1994,  Han  et  al.,  2004,  Han  and  Kamber, 

].  However,  the  user  needs  to  specify  the  “support”  parameter.  The  related  work  on  “interest¬ 
ingness”  [Tuzhilin  and  Adomavicius,  2002]  still  does  not  answer  the  question  of  “support.” 

Information  retrieval  and  LSI:  The  pioneering  method  of  LSI  [Deerwester  et  al.,  ]  used  SVD 
on  the  term-document  matrix.  Again,  the  number  k  of  eigenvectors/concepts  to  keep  is  up  to 
the  user  ([Deerwester  et  al.,  1 99  ]  empirically  suggest  about  200  concepts).  Additional  matrix 
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decompositions  include  the  Semi-Discrete  Decomposition,  (SDD)  [  <Colda  and  O’Leary,  1998], 
PLSA  [  lofmann,  1999],  the  clever  use  of  random  projections  to  accelerate  SVD  [Papadimitriou 
et  al.,  1998],  and  many  more.  However,  they  all  fail  on  property  (PI). 

Other  domains:  Also  related  to  graphs  in  several  settings  is  the  work  on  conjunctive  cluster¬ 
ing  [  lishra  et  al ,  ]  requires  density  (i.e.,  “homogeneity”)  and  overlap  parameters,  as  well 

as  community  detection  [  'eddy  and  Kitsuregawa,  2(  ],  among  many  others.  Finally,  there  are 

several  approaches  to  cluster  micro-array  data  (e.g.,  [  ang  and  Zhang,  200  ]). 

In  conclusion,  the  above  methods  miss  one  or  more  of  our  prerequisites,  typically  (PI).  Next, 
we  present  our  solution. 


5.2  Data  Description  Model 

First,  we  will  define  some  terminology  (see  Table  5.1).  Let  the  graph  to  be  clustered  be  G,  with 
A  being  its  adjacency  matrix  of  size  M  x  N  (M  =  N  if  G  is  a  self-graph).  Let  us  index  the  rows 
as  1,  2 ,M  and  columns  as  1,  2, . . . ,  N. 

A  cross-association  (<3>,  \P)  of  k  row  clusters  and  t  column  clusters  is  defined  by: 

$:{1,2,...,M}  -  {1,2,...,*}  (5.1) 

*:{1,2,...,A}  -  {1,2,. 

where  $  and  \P  are  the  mappings  of  rows  to  row  clusters  and  columns  to  column  clusters.  For  a 
self-graph,  where  rows  and  columns  represent  identical  nodes,  we  might  want  a  single  set  of  node 
clusters  instead  of  separate  row  and  column  clusters  (we  will  call  this  the  Identical  case);  then, 
$  =  'P  and  k  —  £. 

Given  ( $ .  T ) ,  we  can  re-order  A  so  that  rows/columns  from  the  same  cluster  are  grouped 
together,  creating  k  ■  £  rectangular/square  blocks  (called  cross-associates),  which  we  denote  by 
Ajj,  with  %  —  1 . . .  M  and  j  =  1 ...  N.  Each  block  AtJ  has  size  n(Aij)  =  a*  •  bj,  and  contains 
n0(Ajj)  “zeros”  and  H|  (A)?)  “ones.”  From  this,  we  can  calculate  the  density  1\  (A,:  j)  of  “ones” 
in  the  block.  High(low)  densities  imply  that  certain  groups  have  stronger(weaker)  connections 
with  other  groups. 

Given  an  initial  matrix  A  (as  in  Figure  5.1(a)),  we  want  to  automatically  find  clusters  (as  in 
Figure  5.1(e)).  How  can  we  estimate  the  “correct”  number  of  row  and  column  clusters,  and  the 
memberships  of  these  clusters? 

5.2.1  Main  Idea 

To  compress  the  matrix,  we  would  prefer  to  have  only  a  few  blocks,  each  of  them  being  very 
homogeneous.  However,  having  more  clusters  lets  us  create  more  homogeneous  blocks  (at  the 
extreme,  having  1  cluster  per  node  gives  M  ■  N  perfectly  homogeneous  blocks  of  size  1  x  1).  Thus, 
the  best  compression  scheme  must  achieve  a  tradeoff  between  these  two  factors,  and  this  tradeoff 
point  indicates  the  best  values  for  k  and  l. 
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Symbol 

Description 

G 

Unweighted  graph 

A 

Binary  adjacency  matrix  corresponding  to  G 
(square  for  self-graphs,  possibly  rectangular  for 
bipartite  graphs) 

M,N 

Number  of  rows  and  columns  in  A 
(M  =  N  for  a  self-graph) 

k,e 

Number  of  row  and  column  groups 

E 

Number  of  edges  in  G 

k*,£* 

Optimal  number  of  groups 

(cfgvp) 

Cross-association 

A-ij 

Cross-associate  (submatrix) 

bj 

Dimensions  of  A,  j 

,j  ) 

Number  of  elements  n(A,  j)  dibj 

n0(A?:j),ni(Ajj) 

Number  of  0,  1  elements  in  AhJ 

(A) 

Number  of  1  elements  in  A;  ni(A)  =  E 

Pq  (■ Aj  j ) ,  Pi  ( A  ij ) 

Densities  of  0,  1  in  At  j 

Pi,j(t) 

Pi,j(t)  =  P\  (Ajj,  t  )  is  the  density  at  time  t 

H(p) 

Binary  Shannon  entropy  function 

C(  Ay) 

Code  cost  for  A* j 

T(A;M,S,tt) 

Total  cost  for  A 

Table  5.1:  Table  of  symbols 

We  accomplish  this  by  a  novel  application  of  the  overall  MDL  philosophy,  where  the  com¬ 
pression  costs  are  based  on  the  number  of  bits  required  to  transmit  both  the  “summary”  of  the 
row/column  groups,  as  well  as  each  block  given  the  groups.  Thus,  the  user  does  not  need  to  set 
any  parameters;  our  algorithm  chooses  them  so  as  to  minimize  these  costs.  We  discuss  the  exact 
cost  function  below. 

5.2.2  The  Cost  Function 

In  conformance  with  the  MDL  philosophy,  we  design  a  lossless  code  for  the  data,  and  use  the 
number  of  encoding  bits  as  the  cost  function.  This  cost  has  two  parts:  (1)  the  description  cost 
of  describing  the  cross-association  ($,  T)  and  the  blocks  formed  by  it,  and  (2)  the  code  cost  of 
describing  the  entire  matrix  given  the  cross-association. 

Description  cost:  When  we  have  separate  row  and  column  clusters,  the  description  of  the  cross¬ 
association  consists  of  the  following  parts: 

1.  Send  the  matrix  dimensions  m  and  n  using  log*  (rn)  +  log*(n),  where  log*  is  the  universal 
code  for  integers  [  ’  issanen,  ]  defined  by 

log* (a)  =  log20)  +  log2  log2(x)  +  . . .  (5.2) 
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where  only  the  positive  terms  are  retained.  This  term  is  independent  of  the  cross-association, 
and,  hence,  while  useful  for  actual  transmission  of  the  data,  will  not  figure  in  our  cost  func¬ 
tion. 


2.  Send  the  row  and  column  permutations  using  M  [log  M]  and  Ar[log  Ar]  bits,  respectively. 
Again,  this  term  is  also  independent  of  cross-association. 

3.  Send  the  number  of  groups:  (k,  l)  in  log*  k  +  log*  t  bits. 


4.  Send  the  number  of  rows  in  each  row  group  and  also  number  of  columns  in  each  column 
group.  Let  us  suppose  that  a\  >  a2  >  . . .  >  ak  >  1  and  b\  >  b2  >  . . .  >  be  >  1.  Compute 


di 


El  at  J  —  k  +  i,  i  —  1, 


.  t=i 
l 


)  -l  +  j,  j  =  1, 

>t=j 


k-l 

£-1 


Now,  the  desired  quantities  can  be  sent  using  the  following  number  of  bits: 


fc-i  t-i 

En°g^i  +  5ZriosM 

i=  1  3= 1 

5.  For  each  cross-associate  At:J,  i  =  1, . . . ,  k  and  j  =  1, . . . ,  £,  send  the  number  of  “ones” 
ni(Ajj)  in  it  using  [log(aj&j  +  1)]  bits. 

Summing  all  of  these: 


Description  Cost  =  log*  k  +  log*  £ 

k-l  l-l 

+En°ga«i +Erio§M 

i=l  j=  1 

k  l 

+EEriog(°A-  +  x)i 

i=  1  j=l 

In  the  Identical  case,  we  must  transmit  only  node  clusters  instead  of  row  and  column  clus¬ 
ters,  and  the  equation  is  modified  to  reflect  this: 

Description  Cost  when  (<3>  =  \F)  =  log*  k 

k-l 

+En°g 

i= 1 
k  k 

+EEwa*aJ  +  x)i 

i=  1  3=  1 
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Code  cost:  Suppose  that  the  entire  preamble  specified  above  (containing  information  about  the 
square  and  rectangular  blocks)  has  been  sent.  We  now  transmit  the  actual  matrix  given  this  infor¬ 
mation.  The  number  of  bits  C(Aijj)  to  encode  each  block  depends  only  on  its  density  I\  (  A;J): 

OA,()  =  n(Aid)-H(P1(AiJ)) 

=  m  (A«)  •  lo«  +  "°  <AiJ) '  i°g  (i "  (53) 

Summing  over  all  blocks, 

k  l 

Code  cost  =  EEc<A..i)  <5-4) 

2=1  3= 1 

For  the  IDENTICAL  case,  the  equation  is  modified  slightly: 

k  k 

Code  cost  when  (<3>  =  \l/)  =  EECA)  (5-5) 

2=1  3= 1 

Final  cost  function:  Summing  the  description  cost  and  the  code  cost  gives  us  the  total  cost  of 
encoding  the  matrix  A  using  the  cross-association  (<f>,  T): 

Total  cost  T(A;  k,  £,  <F,  \F)  =  log*  A;  +  log*  f 

k- 1  t- 1 

+En°g«2i +Erio§M 

2=1  j  =  1 

k  i 

+ 1)1 

2=1  3=  1 
k  t 

+  EEClA«)  (5.6) 

2=1  i=i 

For  the  Identical  case: 

Total  cost  when  ($  =  \1>)  T( A;  k,  <f>)  =  log*  k 

k- 1 

+En°s^ 

^=1 
k  k 

+EE^log(ai(ii + 

2=1  j  =  1 
k  k 

+  EEW  (5.7) 

2=1  j=l 
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Iteration  1  (rows)  Iteration  2  (cols) 


100  200  300  400  500  600  100  200  300  400  500  600 

Column  Clusters  Column  Clusters 


Iteration  4  (cols) 


100  200  300  400  500  600 

Column  Clusters 


(a)  Original  groups  (b)  Step  2 


(c)  Step  4 


(d)  Step  4 


Figure  5.2:  Iterations  of  the  SHUFFLE  algorithm:  Holding  k  and  t  fixed  (k  —  £  —  3),  we  re¬ 
peatedly  apply  Steps  2  and  4  of  the  Shuffle  algorithm  until  no  improvements  are  possible  (Step 
6).  Iteration  3  (Step  2)  is  omitted,  since  it  performs  no  swapping.  To  potentially  decrease  the  cost 
further,  we  must  increase  k  or  £  or  both,  which  is  what  the  Split  algorithm  does. 


5.3  (Gl)  Graph  Clustering  Algorithm 

The  equations  mentioned  above  assign  a  cost  to  each  possible  clustering;  given  two  distinct  clus¬ 
terings,  we  can  use  the  cost  function  to  choose  between  the  two.  How  can  we  use  this  to  find  a 
good  clustering?  In  other  words,  how  do  we  pick  the  best  numbers  of  clusters  k*  and  £*,  and  the 
memberships  of  these  clusters? 

We  propose  an  iterative  two-step  scheme  to  answer  this  question: 

1.  Shuffle  (inner  loop):  For  a  given  k  and  i,  find  a  good  arrangement  (i.e.,  cross-association). 

2.  Split  (outer  loop):  Efficiently  search  for  the  best  k  and  i  (k,  £  —  1,2,.. .). 


We  present  each  in  the  following  sections. 


5.3.1  Shuffle  (Inner  Loop) 

Shuffle  (Figure  5.3)  is  a  simple  and  efficient  alternating  minimization  algorithm  that  yields  a 
local  minimum  for  the  code  cost  (Equation  5.4,  or  5.5  for  the  Identical  case).  In  the  regions 
where  Shuffle  is  performed,  the  code  cost  typically  dominates  the  description  cost;  thus,  lower¬ 
ing  the  code  cost  also  lowers  the  total  cost.  Figure  5.2  shows  snapshots  of  SHUFFLE  iterations  on 
an  example  dataset. 

Theorem  5.  After  each  iteration  of  SHUFFLE,  the  code  cost  either  decreases  or  remains  the  same: 

Si=i  Yfj= i  >  £t=i  Y?j=i c(Ai,j(t  + 1))  >  i  X^=i +  2)). 

Proof  The  proof  is  provided  in  Section  5.6.  □ 

The  algorithm  is  analogous  for  the  IDENTICAL  case:  instead  of  considering  row  splices  x:l  and 
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column  splices  yl  separately,  we  must  consider  them  together: 


k  r 


$t+i(x)  =  argminV' 

i <i<k  j=1 

+  ni(yJ)  log 


77-1  (xj)  log  — — +  no  (a;-7  )  log  ( 1  — 


Pi,j  (t) 


+  no(yJ)  log  ( 1  — 


+  1) 


+Aa 


+  logP$t(x),^)  -  1o§PmW 


+(1  -  AXjX)  log  (l  -  ^  (*))  +  log  (l  -  ~  log  (1  -  pvW) 


where  Ax )X  is  one  if  the  node  has  a  link  to  itself  (a  self-loop),  and  zero  otherwise.  Thus,  the  first 
two  terms  are  the  same  as  in  Figure  5.3,  and  the  last  two  terms  take  care  of  “double-counting”  of 
the  self-loop. 


5.3.2  Split  (Outer  Loop) 

Every  time  Shuffle  converges  to  a  local  minimum  of  the  code  cost,  we  can  search  for  better 
values  of  k  and  i.  This  search  should  attempt  to  lower  the  total  cost  (Equation  5.6  or  5.7),  and  we 
can  employ  any  integer  or  even  continuous  optimization  algorithm  (gradient  descent,  Nelder-Mead, 
simulated  annealing).  We  experimented  with  several  alternatives,  and  obtained  the  best  results  with 
the  Split  algorithm,  shown  in  Figure  5.4.  The  Identical  case  is  analogous.  Figure  5.1  gives 
example  snapshots  from  the  execution  of  the  full  algorithm,  with  each  plot  showing  the  results 
after  one  iteration  of  Split,  followed  by  Shuffle  iterations. 

Theorem  6.  On  each  iteration  of  Split,  the  code  cost  either  decreases  or  remains  the  same:  If 
A  =  [AxA2],  then  C( Ax)  +  67(A2)  <  67(A). 

Proof  The  proof  is  provided  in  Section  5.6.  □ 


5.3.3  Matching  the  desired  properties 

To  summarize,  our  algorithm  finds  clusters  in  a  graph  by  repeatedly  searching  for  better  values  for 
k  and  6  (the  numbers  of  groups),  and  then  rearranging  the  adjacency  matrix  to  decrease  the  cost, 
while  keeping  the  numbers  of  groups  fixed.  We  have  thus  matched  goal  (Gl),  while  matching  all 
of  our  desired  properties: 

•  (PI)  By  Theorems  5  and  6,  each  of  these  steps  maintains  or  decreases  the  code  cost.  How¬ 
ever,  the  description  complexity  evidently  increases  with  k  and  i.  The  total  cost  metric 
(Equation  5.6  or  5.7  depending  on  the  problem  setting)  biases  the  algorithm  towards  fewer 
clusters,  and  provides  an  automatic  stopping  criterion.  Thus,  the  algorithm  requires  no  hu¬ 
man  intervention,  and  is  completely  automatic. 

•  (P2)  By  construction,  the  algorithm  treats  rows  and  columns  equally  and  simultaneously. 
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Algorithm  Shuffle  (Finding  better  (<£>,  \P)  given  k,£): 

1.  Let  t  denote  the  iteration  index.  Initially,  set  t  =  0.  If  no  ((I>o.  'I'd)  is  provided,  start  with  an 

arbitrary  ($0,  'To)  mapping  nodes  into  k  node  groups.  For  this  initial  partition,  compute  the 
submatrices  and  the  corresponding  distributions  P\  (A.;j.  t)  =  PhJ{t). 

2.  We  will  now  hold  column  assignments,  namely,  T  , ,  fixed.  For  every  row  x,  splice  it  into  £ 
parts  each  corresponding  to  one  of  the  column  groups.  Denote  them  as  x1, . . . ,  xl .  For  each 
of  these  parts,  compute  n0(xj)  and  ri\  (x3),  where  j  =  1, . . . ,  i.  Now,  shuffle  row  x  to  row 
group  <Ft+i(a;)  such  that: 


$ 


t+iW  =  argmin  > 

1 <i<k  j=l  L 


ri\  (x3  )  log 


+  n0(xj)  log  (  1 


(5.8) 


3.  With  respect  to  cross-association  \Ft),  recompute  the  matrices  A;J(/;  +  1),  and  corre¬ 
sponding  distributions  P\  (AltJ.  t  +  1 ) )  =  Pij(t  +  1). 

4.  For  this  step,  we  will  hold  row  assignments,  namely,  $4+1,  fixed.  For  every  column  y,  splice 
it  into  k  parts  each  corresponding  to  one  of  the  column  groups.  Denote  them  as  y1, ...  ,yk. 
For  each  of  these  parts,  compute  n0(yl)  and  n\(y%),  where  i  =  1, . . . ,  k.  Now,  assign  y  to 
column  group  \Pt+2(y)  such  that: 


k  p 

Vt+2{y)  =  arg  min  V  m (yr)  log 
i=i  L 


Pi,j(t  +  1) 


+  n0 (2/*)  log  1  - 


Pij(t  +  1) 


(5.9) 


5.  For  the  new  cross-association  ($4+1,  2),  recompute  the  matrices  A ^(t  +  2),  and  corre¬ 

sponding  distributions  P\  (Aj^,  t  +  2))  =  Pij(t  +  2). 

6.  If  there  is  no  decrease  in  total  cost,  stop;  otherwise,  set  t  —  t  +  2,  go  to  step  2,  and  iterate. 


Figure  5.3:  Algorithm  SHUFFLE  (inner  loop) 


•  (P3)  The  overall  complexity  is  O  (E  ■  (k*  +  £*)'2),  if  we  ignore  the  number  of  Shuffle 
iterations  /  (in  practice,  /  <  20  is  always  sufficient).  Thus,  it  is  linear  in  the  number  of 
edges,  and  hence  scalable. 

•  (P4)  When  new  nodes  are  obtained  (such  as  from  new  crawls  for  a  Web  graph),  we  can  put 
them  into  the  clusters  which  minimize  the  increase  in  total  encoding  cost.  Similarly,  when 
new  edges  are  found,  the  corresponding  nodes  can  be  reassigned  to  new  clusters.  The  algo¬ 
rithm  can  then  be  run  again  with  this  initialization  till  it  converges.  Similar  methods  apply 
for  node/edge  deletions.  Thus,  new  additions  or  deletions  can  be  handled  incrementally. 

•  (P5)  The  algorithm  is  applicable  to  both  self-graphs  and  bipartite  graphs.  When  we  require 
a  single  set  of  node  clusters  instead  of  separate  row  and  column  clusters,  the  extra  constraint 
can  be  handled  easily,  as  we  demonstrated  for  the  Identical  case. 
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Algorithm  Split  (Choosing  better  k  and  £): 


1.  Let  T  denote  the  search  iteration  index.  Start  with  T  =  0  and  k°  =  £°  =  1. 

2.  At  iteration  T,  increase  the  number  of  row  groups:  k(T  + 1)  =  k(T)  + 1.  Split  the  row  group 
r  with  maximum  entropy,  i.e., 

r  ■■=  arg  max  ^  m  (AhJ)  ■  H  {Phj) 

l<i<k 


Construct  an  initial  label  map  (I>j.  i  as  follows:  For  every  row  x  that  belongs  to  row  group  r 
(i.e.,  $t(t)  =  r),  place  it  into  the  new  group  k(T  +  1)  (i.e.,  set  $T+1(a;)  =  k(T  +  1))  if  and 
only  if  it  decreases  the  per-row  entropy  of  the  group  r,  i.e.,  if  and  only  if 


E 

i  <j<e 


(yy  ■  h  (ka 

ar  —  1 


< 


E 

i<j<e 


ni  (Ajj)  ■  H  ( Pr,j ) 
ar 


where  A'rj  is  Ar  j  without  row  x.  Otherwise  let  <l>T+1(x)  =  r  =  If  we  move  the  row 

to  the  new  group,  we  also  update  Dr  ]  (for  all  1  <  j  <  l). 

3.  Run  Shuffle  (the  inner  loop)  with  initial  configuration  (<3>t+i,  ^t)  to  find  a  new  cross¬ 
association  and  the  corresponding  total  cost. 

4.  If  there  is  no  decrease  in  total  cost,  stop  and  return  (k*,  t )  =  {k'\  iT) — with  corresponding 
cross-associations  (f&T,  ^t)-  Otherwise,  set  T  =  T  +  1  and  continue. 

5-7.  Similar  to  steps  2-4,  but  with  columns  instead. 


Figure  5.4:  Algorithm  Split  (outer  loop) 


5.3.4  Relationship  with  hierarchical  clustering 

Our  algorithm  finds  one  flat  set  of  node  groups.  Even  though  each  new  group  is  formed  by  a  split 
of  a  pre-existing  group,  the  SHUFFLE  iterations  mix  up  the  group  memberships,  and  any  hierarchy 
is  destroyed.  However,  we  could  obtain  a  hierarchical  clustering  by  (1)  running  the  full  algorithm 
to  obtain  a  flat  clustering,  (2)  separately  applying  the  entire  algorithm  on  each  of  these  clusters, 
and  (3)  repeating.  This  is  an  avenue  for  future  work. 


5.4  Finding  outlier  edges  and  inter-cluster  distances 

Having  found  the  underlying  structure  of  a  graph  in  the  form  of  node  groups  (goal  (Gl)),  we 
can  utilize  this  information  to  further  mine  the  data.  Specifically,  we  want  to  detect  outlier  edges 
(goal  (G2))  and  compute  inter-group  “distances”  (goal  (G3)).  Again,  we  use  our  information- 
theoretic  approach  to  solve  all  these  problems  efficiently. 
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5.4.1  (G2)  Outlier  edges 


Which  edges  between  nodes  are  abnormal/suspicious?  Intuitively,  an  outlier  shows  some  deviation 
from  normality,  and  so  it  should  hurt  attempts  to  compress  data.  Thus,  an  edge  whose  removal 
significantly  reduces  the  total  encoding  cost  is  an  outlier.  Our  algorithm  is:  find  the  block  where 
removal  of  an  edge  leads  to  the  maximum  immediate  reduction  in  cost  (that  is,  no  iterations  of 
Shuffle  and  Split  are  performed).  All  edges  within  that  block  contribute  equally  to  the  cost, 
and  so  all  of  them  are  considered  outliers. 


“Outlierness”  of  edge  (u,  v)  :=  T(A/;  k,  f,  <T,  'k)  —  T(A;  k,  £,  <T,  'k)  (5.10) 


where  A'  is  A  without  the  edge  (u,v).  This  can  be  used  to  rank  the  edges  in  terms  of  their 
“outlierness”. 

5.4.2  (G3)  Computing  inter-group  “distances” 

How  “close”  are  two  node  groups  to  each  other?  Following  our  information  theory  footing,  we 
propose  the  following  criterion:  If  two  groups  are  “close”,  then  combining  the  two  into  one  group 
should  not  lead  to  a  big  increase  in  encoding  cost.  Based  on  this,  we  define  “distance”  between 
two  groups  as  the  relative  increase  in  encoding  cost  if  the  two  were  merged  into  one: 


Dist(i,j) 


Cost  (merged)  —  Cost(i )  —  Cost(j) 
Cost(i)  +  Cost(j) 


(5.11) 


where  only  the  nodes  in  groups  i  and  j  are  used  in  computing  costs.  We  experimented  with  other 
measures  (such  as  the  absolute  increase  in  cost)  but  Equation  5.11  gave  the  best  results. 

To  computing  outliers  and  distances  between  groups,  only  the  statistics  of  the  final  clustering 
need  to  be  used  (i.e.,  the  block  sizes  n^j  and  their  densities  /\;).  Thus,  both  can  be  performed 
efficiently  for  large  graphs. 


5.5  Experiments 

We  did  experiments  to  answer  two  key  questions: 

(QD  How  good  is  the  quality  of  the  clusters? 

(Q2)  How  well  do  our  algorithms  find  outlier  edges? 

(Q3)  How  well  do  our  measures  of  inter-cluster  “distances”  work? 
(Q4)  Is  our  method  scalable? 
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We  carried  out  experiments  on  a  variety  of  datasets,  both  synthetic  and  real-world  (Table  5.2).  The 
synthetic  datasets  were: 

•  CAVE:  This  represents  a  social  network  of  “cavemen”  [  fatts,  1999],  that  is,  a  block-diagonal 
matrix  of  variable-size  blocks  (or  “caves”). 

•  CUSTPROD:  This  represents  groups  of  customers  and  their  buying  preferences1 

•  CAVE-NOISY:  A  CAVE  dataset  with  “salt-and-pepper”  noise  (10%  of  the  number  of  edges). 

•  NOISE:  This  contains  pure  white  noise. 

All  of  these  datasets  were  scrambled  before  being  fed  into  the  cross-associations  program  as  input. 
The  real-world  datasets  were: 

•  CLASSIC:  A  bipartite  graph  of  Usenet  documents  from  Cornell’s  SMART  collection,  and  the 

words  present  in  them  (see  [  )hillon  et  al ,  ]).  The  documents  belong  to  three  distinct 

groups:  MEDLINE  (medicine),  CISI  (information  retrieval),  and  CRANFIELD  (aerody¬ 
namics). 

•  GRANTS:  A  set  of  NSF  proposal  documents  from  several  disciplines  (physics,  bio-informatics, 
etc.),  versus  the  words  in  their  abstracts. 

•  EPINIONS:  A  “who-trusts-whom”  social  graph  of  www .  epinions  .  com  users  [Domingos 
and  Richardson,  2001]. 

•  CLICKSTREAM:  A  graph  of  users  and  the  URLs  they  clicked  on  [Montgomery  and  Falout- 

sos,  2001]. 

•  OREGON:  A  graph  of  network  connections  between  Autonomous  Systems  (AS),  obtained 
from 

http : / / topology . eecs . umich . edu/ data . html. 

•  DBLP:  A  co-citation  and  co-authorship  graph  extracted  from 
www.informatik.uni-trier.de/~ley/db.  The  nodes  are  authors  in  the  SIG- 
MOD,  ICDE,  VLDB,  PODS  or  ICDT  (database  conferences);  two  nodes  are  linked  by  an 
edge  if  the  two  authors  have  co-authored  a  paper  or  one  has  cited  a  paper  by  the  other  (thus, 
this  graph  is  undirected). 

Our  implementation  was  done  in  MATLAB  (version  6.5  on  Linux)  using  sparse  matrices.  The 
experiments  were  performed  on  an  Intel  Xeon  2.8GHz  machine  with  1GB  RAM. 

5.5.1  (Ql)  Quality  of  clustering 

Synthetic  Datasets:  Figure  5.5  shows  the  resulting  cross-associations  on  the  synthetic  datasets. 
We  can  make  the  following  observations: 

'We  try  to  capture  market  segments  with  heavily  overlapping  product  preferences,  like,  say,  “single  persons,” 
buying  beer  and  chips,  “couples,”  buying  the  above,  plus  frozen  dinners,  “families,”  buying  all  the  above  plus  milk, 
and  so  on. 
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Dataset 

Dimensions 

Edges 

(directed) 

Graph  type 

CAVE 

810x900 

162,000 

Bipartite 

CAVE-NOISY 

810x900 

171,741 

Bipartite 

CUSTPROD 

295x30 

5,820 

Bipartite 

NOISE 

100x100 

952 

Self 

CLASSIC 

3,893x4,303 

176,347 

Bipartite 

GRANTS 

13,297x5,298 

805,063 

Bipartite 

EPINIONS 

75,888x75,888 

508,960 

Self 

CLICKSTREAM 

23,396x199,308 

952,580 

Bipartite 

OREGON 

11,461x11,461 

65,460 

Self 

DBLP 

6,090x6,090 

175,494 

Self 

Table  5.2:  Dataset  characteristics. 
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Figure  5.5:  Cross-associations  on  synthetic  datasets :  Our  method  gives  the  intuitively  correct 
cross-associations  for  (a)  CAVE  and  (b)  CUSTPROD.  Some  extra  groups  are  found  for  (c)  CAVE- 
NOISY,  which  are  explained  by  the  patterns  emerging  due  to  randomness,  such  as  the  “almost- 
empty”  and  “more-dense”  cross-associations  for  NOISE  (d). 

•  Exact  results  on  noise-free  datasets:  For  CAVE  and  CUSTPROD,  we  get  exactly  the  intu¬ 
itively  correct  groups.  This  serves  as  a  sanity  check. 

•  Robust  performance  on  noisy  datasets:  For  CAVE-NOISY,  we  find  some  extra  groups  which, 
on  closer  examination,  are  picking  up  patterns  in  the  noise.  This  is  expected:  it  is  well 
known  that  spurious  patterns  emerge,  even  when  we  have  pure  noise.  Figure  5.5(d)  confirms 
it:  even  in  the  NOISE  matrix,  our  algorithm  finds  blocks  of  clearly  lower  or  higher  density. 
However,  even  with  such  significant  noise,  very  few  “spurious”  groups  are  found. 

CLASSIC/  Figure  5.6(a)  shows  the  clusters  found  in  the  CLASSIC  dataset. 

•  The  results  match  the  known  groupings:  We  see  that  the  cross-associates  are  in  agreement 
with  the  known  document  classes  (left  axis  annotations).  We  also  annotated  some  of  the 
column  groups  with  their  most  frequent  words.  Cross-associates  belonging  to  the  same  doc¬ 
ument  (row)  group  clearly  follow  similar  patterns  with  respect  to  the  word  (column)  groups. 
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manifolds,  operators,  undergraduate,  education, 

harmonic,  operator,  topological  natipnal,  projects 


insipidus,  alveolar,  aortic,  blood,  disease,  clinical,  shape,  nasa,  leading, 
death,  prognosis,  intravenous  ceu,  dssue,  patient  assumed,  thin 


CISi 


CRANFIELD 


providing,  studying,  records,  \  paint,  examination,  fall, 

developments,  students,  abstract,  notation,  works  raise,  leave,  based 
rules,  community  construct,  bibliographies 

(a)  CLASSIC  ( k *  =  15,  t  =  19) 


encoding,  characters,  | 

bind,  nucleus,  coupling,  deposition, 

recombination  plasma,  separation,  beam 


meetings,  organizations, 
session,  participating 


(b)  GRANTS  (, k *  =  41,  t  =  28) 


Figure  5.6:  Cross-associations  for  CLASSIC  and  GRANTS:  Due  to  the  dataset  sizes,  we  show  the 
Cross-associations  via  shading;  darker  shades  correspond  denser  blocks  (more  ones).  We  also 
show  the  most  frequently  occurring  words  for  several  of  the  word  (column)  groups;  these  clearly 
belong  to  different  categories  of  words. 


For  example,  the  MEDLINE  row  groups  are  most  strongly  related  to  the  first  and  second 
column  groups,  both  of  which  are  related  to  medicine,  (“insipidus,”  “alveolar,”  “prognosis” 
in  the  first  column  group;  “blood,”  “disease,”  “cell,”  etc,  in  the  second). 

•  Previously  unknown  structure  is  revealed:  Besides  being  in  agreement  with  the  known  doc¬ 
ument  classes,  the  cross-associates  reveal  further  structure.  For  example,  the  first  word 
group  consists  of  more  “technical”  medical  terms,  while  second  group  consists  of  “every¬ 
day”  terms,  or  terms  that  are  used  in  medicine  often,  but  not  exclusively* 2.  Thus,  the  second 
word  group  is  more  likely  to  show  up  in  other  document  groups  (and  indeed  it  does,  although 
not  immediately  apparent  in  the  figure),  which  is  why  our  algorithm  separates  the  two. 


GRANTS:  Again,  the  results  mirror  those  for  the  CLASSIC  dataset: 

•  Meaningful  clusters  are  extracted:  Figure  5.6(b)  shows  the  most  common  terms  in  several 
of  the  column  clusters.  They  show  that  the  groups  found  make  intuitive  sense:  we  detect 
clusters  related  to  biology  (“encoding,”  “recombination,”  etc),  to  physics  (“coupling,”  “de¬ 
position”,  “plasma,”  etc),  to  material  sciences,  and  to  several  other  well-known  topics. 


EPINIONS,  CLICKSTREAM,  OREGON  and  DBLP:  Figure  5.7  shows  our  results  on  all  the  other 

datasets.  EPINIONS  and  DBLP  are  clustered  under  the  IDENTICAL  setting,  so  that  the  row  and 
column  clusters  are  the  same.  The  results  make  intuitive  sense: 

2This  observation  is  also  true  for  nearly  all  of  the  (approximately)  600  and  100  words  belonging  to  each  group,  not 
only  the  most  frequent  ones  shown  here. 
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3  Node  Gr6ups  6 


(a)  EPINIONS 
(k*  =  19) 


(b)  CLICKSTREAM 

(k*  =  15,  t  =  13) 


(c)  OREGON 


(k*  =  9,C  =  8) 


NWe  Gftltips  ! 


(d)  DBLP 

(fc*  =  8) 


Figure  5.7:  Cross-associations  for  other  real-world  datasets:  CLICKSTREAM  and  OREGON  are 
clustered  normally,  while  EPINIONS  and  DBLP  were  clustered  under  the  IDENTICAL  setting  (i.e., 
identical  row  and  column  clusters).  All  the  graphs  get  separated  into  two  kinds  of  groups:  large 
but  very  sparse,  and  small  but  very  dense.  The  small  dense  clusters  are  often  worthy  of  further 
analysis;  for  example,  most  well-known  database  researchers  show  up  in  the  dense  regions  of  plot 
(d). 


•  For  example,  for  the  DBLP  dataset,  the  smallest  group  consists  only  of  Michael  Stonebraker, 
David  DeWitt  and  Michael  Carey;  these  are  well-known  people  who  have  a  lot  of  papers  and 
citations.  The  other  groups  show  decreasing  number  of  connections  but  increasing  sizes. 

•  Similarly,  for  the  EPINIONS  graph,  we  find  a  small  dense  “core”  group  which  has  very  high 
connectivity,  and  then  larger  and  less  heavily-connected  groupings. 


5.5.2  (Q2)  Outlier  edges 

To  test  our  algorithm  for  picking  outliers,  we  use  a  synthetic  dataset  as  in  Figure  5.8(a).  The 
node  groups  found  are  shown  in  5.8(b).  Our  algorithm  tags  all  edges  whose  removal  would  best 
compress  the  graph  as  outliers.  Thus,  all  edges  “across”  the  two  groups  are  chosen  as  outliers 
under  this  principle  (since  all  edges  in  a  block  contribute  equally  to  the  encoding  cost),  as  shown 
in  Figure  5.8(b).  Thus,  the  intuitively  correct  outliers  are  found. 

5.5.3  (Q3)  Inter-cluster  “distances” 

To  test  for  node-group  distances,  we  use  the  graph  in  5.8(c)  with  5.8(d)  showing  the  structure 
found.  The  three  caves  have  equal  sizes  but  the  number  of  “bridge”  edges  between  groups  varies. 
This  is  correctly  picked  up  by  our  algorithm,  which  ranks  groups  with  more  “bridges”  as  being 
closer  to  each  other.  Thus,  groups  2  and  3  are  tagged  as  the  “closest”  groups,  while  groups  1  and  2 
are  “farthest”. 

Figure  5.8(e)  shows  the  inter-group  distances  between  the  clusters  found  for  DBLP.  The  dis¬ 
tances  were  computed  using  our  algorithm  and  plotted  using  Graphviz 3:  longer  lines  imply  larger 

3http : / / www . research . att . com/ sw/tools/ graphviz/ 
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(c)  Distance-test 
graph 


(d)  Node  groups 
and  distances 


(e)  DBLP 
distances 


Figure  5.8:  Outliers  and  group  distances:  Plot  (b)  shows  the  node  groups  found  for  graph  (a). 
Edges  in  the  top-right  block  are  correctly  tagged  as  outliers.  Plot  (d)  shows  the  node  groups  and 
group  distances  for  graph  (c).  Groups  2  and  3  (having  the  most  “bridges”)  are  tagged  as  the  closest 
groups.  Similarly,  groups  1  and  2  are  the  farthest,  because  they  have  no  “bridges”.  Plot  (e)  is  a 
visualization  of  the  distances  between  groups  for  the  DBLP  dataset. 


distances.  The  smallest  group  (Stonebraker,  DeWitt  and  Carey)  is  “Grp8 ,”  and  we  that  it  occupies 
a  central  position  exactly  because  it  has  many  “bridges”  with  people  from  other  groups.  Similarly, 
“Grpl”  is  seen  to  be  far  from  all  other  groups:  this  is  because  authors  in  “Grpl”  have  very  few 
papers  in  the  conferences  included  in  this  dataset,  and  thus  their  connections  to  the  rest  of  the  graph 
are  very  sparse. 


5.5.4  (Q4)  Scalability 

We  have  seen  that  our  algorithms  give  accurate  and  intuitive  results  on  a  variety  of  synthetic  and 
real-world  datasets.  Here,  we  will  verify  their  scalability. 

Figure  5.9  shows  results  on  a  “caveman”  graph  with  three  caves.  We  see  that  the  wall-clock 
time  scales  linearly  with  the  number  of  edges  E.  Thus,  our  algorithms  are  scalable  to  large  graph 
datasets. 
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Figure  5.9:  Scalability:  We  show  timing  results  on  a  “caveman”  graph  with  3  caves.  The  plot 
shows  wall-clock  time  vs.  the  number  of  edges  E  in  the  graph,  for  both  Split  (dashed),  as  well  as 
Shuffle  for  a  particular  (k,  €)  (solid).  We  see  that  both  are  linear  in  E. 


5.6  Details  of  Proofs 


Theorem  2.  After  each  iteration  of  SHUFFLE,  the  code  cost  either  decreases  or  remains  the  same: 

X^Li  i  c(a ij(t))  >  Ya=i  Yfj= i  c(a ij(t  + 1))  >  +  2)). 


Proof  We  shall  only  prove  the  first  inequality,  the  second  inequality  will  follow  by  symmetry 
between  rows  and  columns.  We  will  need  some  notation: 


Piyj  (t)  if  u  =  1 
1  -Pi,j(t)  if  w  =  0 
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where  (a)  follows  from  Step  2  of  the  Cross-association  Algorithm;  (b)  follows  by  re-writing  the 
outer  two  sums-since  i  is  not  used  anywhere  inside  the  [■  ■  ■]  terms;  and  (c)  follows  from  the  non¬ 
negativity  of  the  Kullback-Leibler  distance.  □ 

Theorem  3.  On  each  iteration  of  Split,  the  code  cost  either  decreases  or  remains  the  same:  If 
A  =  [A]A2],  then  C( Afj  +  C( A2)  <  C(A). 


Proof  We  have 


C(A)  =  n(A)H  (PA)  =  „(A)H 


=  H 


PAin(Ai)  +  PAn(A2 


n(A) 


=  n(A1)H(pAJ+n(A2)ff(pAi 
=  C(A0+C(A2) 

where  the  inequality  follows  from  the  concavity  of  H(-)  and  the  fact  that  n(Ai)  +  n(A2)  =  n(A) 
orn(Ai)/n(A)  +  n(A2)/n(A)  =  1.  □ 


n(  A2) 


5.7  Summary 

Our  aims  were  very  broad:  we  wanted  to  find  underlying  structure  in  a  graph.  To  this  end,  we 
introduced  a  novel  approach  and  proposed  a  general,  intuitive  model  founded  on  lossless  compres- 
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sion  and  information-theoretic  principles.  Based  on  this  model,  we  provided  algorithms  for  mining 
large  graph  datasets  in  a  variety  of  ways.  Specifically: 

•  We  proposed  one  of  the  few  methods  for  clustering  nodes  in  a  graph,  that  needs  no  “magic” 
numbers.  Our  algorithms  figure  out  both  the  number  of  clusters  in  the  data,  and  their  mem¬ 
berships. 

•  Besides  being  fully  automatic,  our  approach  satisfies  all  of  the  desired  properties  (P1)-(P5). 
This  includes  scalability  to  large  datasets,  ability  to  function  online  (with  incremental  data 
input),  and  applicability  to  both  self-graphs  and  bipartite  graphs. 

•  In  addition  to  clustering,  we  also  proposed  efficient  methods  to  detect  outlier  edges  and 
to  compute  inter-cluster  “distances.”  Both  of  these  are  important  standalone  data-mining 
problems  in  their  own  right. 

Thus,  we  were  able  to  achieve  all  our  goals  (G1)-(G3).  Experiments  on  a  large  set  of  synthetic  and 
real-world  datasets  show  the  accuracy  and  efficiency  of  our  methods;  not  only  were  we  able  to  find 
known  patterns  in  the  data,  but  we  could  also  detect  unknown  structure,  as  in  the  CLASSIC  dataset. 
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Chapter  6 

The  R-MAT  graph  generator 


“How  can  we  quickly  generate  a  synthetic  yet  realistic  graph?  How 
can  we  spot  fake  graphs  and  outliers?” 

While  we  answered  application-specific  questions  in  the  previous  chapters,  the  focus  of  this 
chapter  is  on  real-world  graphs  in  general.  What  do  real  graphs  look  like?  What  patterns  or  “laws” 
do  they  obey?  This  is  intimately  finked  to  the  problem  of  designing  a  good  graph  generator:  a 
realistic  generator  is  one  which  matches  exactly  these  graph  “laws.”  These  patterns  and  generators 
are  important  for  many  applications: 

•  Detection  of  abnormal  subgraphs/edges/nodes:  Abnormalities  should  deviate  from  the  “nor¬ 
mal”  patterns,  so  understanding  the  patterns  of  naturally  occurring  graphs  is  a  prerequisite 
for  detection  of  such  outliers. 

•  Simulation  studies:  Algorithms  meant  for  large  real-world  graphs  can  be  tested  on  synthetic 
graphs  which  “look  like”  the  original  graphs.  This  is  particularly  useful  if  collecting  the  real 
data  is  hard  or  costly. 

•  Realism  of  samples:  Most  graph  algorithms  are  super-linear  on  the  node  count,  and  thus 
prohibitive  for  large  graphs.  We  might  want  to  build  a  small  sample  graph  that  is  “similar” 
to  a  given  large  graph.  In  other  words,  this  smaller  graph  needs  to  match  the  “patterns”  of 
the  large  graph  to  be  realistic. 

•  Extrapolation  of  data:  Given  a  real,  evolving  graph,  we  expect  it  to  have  x%  more  nodes 
next  year;  how  will  it  then  look  like,  assuming  that  our  “laws”  are  still  obeyed?  For  example, 
in  order  to  test  the  next- generation  Internet  protocol,  we  would  like  to  simulate  it  on  a  graph 
that  is  “similar”  to  what  the  Internet  will  look  like  a  few  years  into  the  future. 

•  Graph  compression:  Graph  patterns  represent  regularities  in  the  data.  Such  regularities  can 
be  used  to  better  compress  the  data. 

Thus,  we  need  to  detect  patterns  in  graphs,  and  then  generate  synthetic  graphs  matching  such 
patterns  automatically. 

There  are  several  desiderata  from  a  graph  generator: 
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(PI)  Realism :  It  should  only  generate  graphs  that  obey  all  (or  at  least  several)  of  the  above  “laws”, 
and  it  would  match  the  properties  of  real  graphs  (degree  exponents,  diameters  etc.,  that  we 
shall  discuss  later)  with  the  appropriate  values  of  its  parameters. 

(P2)  Procedural  generation :  Instead  of  creating  graphs  to  specifically  match  some  patterns,  the 
generator  should  offer  some  process  of  graph  generation,  which  automatically  leads  to  the 
said  patterns.  This  is  necessary  to  gain  insight  into  the  process  of  graph  generation  in  the 
real  world:  if  a  process  cannot  not  generate  synthetic  graphs  with  the  required  patterns,  it  is 
probably  not  the  underlying  process  (or  at  least  the  sole  process)  for  graph  creation  in  the 
real  world. 

(P3)  Parsimony:  It  should  have  a  few  only  parameters. 

(P4)  Fast  parameter-fitting:  Given  any  real-world  graph,  the  model  parameters  should  easily  tun¬ 
able  by  some  published  algorithms  to  generate  graph  similar  to  the  input  graph. 

(P5)  Generation  speed:  It  should  generate  the  graphs  quickly,  ideally,  linearly  on  the  number  of 
nodes  and  edges. 

(P6)  Extensibility:  The  same  method  should  be  able  to  generate  directed,  undirected,  and  bipartite 
graphs,  both  weighted  or  unweighted. 


This  is  exactly  the  main  part  of  this  work.  We  propose  the  Recursive  Matrix  (R-MAT)  model, 
which  naturally  generates  power- law  (or  “DGX”  [  ii  et  al ,  ] )  degree  distributions.  We  show 

that  it  naturally  leads  to  small- world  graphs  and  also  matches  several  other  common  graph  patterns; 
it  is  recursive  (=self- similar),  and  it  has  only  a  small  number  of  parameters. 

The  rest  of  this  chapter  is  organized  as  follows:  Section  6.1  surveys  the  existing  graph  laws  and 
generators.  Section  6.2  presents  the  idea  behind  our  R-MAT  method.  We  discuss  the  properties 
of  R-MAT  graphs,  and  algorithms  for  graph  generation  and  parameter-fitting  under  R-MAT.  Sec¬ 
tion  6.3  gives  the  experimental  results,  where  we  show  that  R-MAT  successfully  mimics  large  real 
graphs.  Section  6.4  provides  details  of  proofs.  A  summary  is  provided  in  Section  6.5. 


6.1  Related  Work 

The  twin  problems  of  finding  graph  patterns  and  building  graph  generators  have  attracted  a  lot  of 
recent  interest,  and  a  large  body  of  work  has  been  done  on  both,  not  only  by  computer  scientists, 
but  also  physicists,  mathematicians,  sociologists,  and  others.  However,  there  is  little  interaction 
among  these  fields,  with  the  result  that  they  often  use  different  terminology  and  do  not  benefit  from 
each  other’s  advances.  In  this  section,  we  attempt  to  give  a  brief  overview  of  the  main  ideas,  with 
a  focus  on  combining  sources  from  all  the  different  fields,  to  gain  a  coherent  picture  of  the  current 
state-of-the-art.  The  interested  reader  is  also  referred  to  some  excellent  and  entertaining  books  on 
the  topic  [Barabasi,  2002,  Watts,  2003]. 
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6.1.1  Graph  Patterns  and  “Laws” 


While  there  are  many  differences  between  graphs,  some  patterns  show  up  regularly.  Work  has 
focused  on  finding  several  such  patterns,  which  together  characterize  naturally  occurring  graphs. 
The  main  ones  appear  to  be: 

•  Power  laws  (already  seen  in  Chapter  2), 

•  Small  diameters,  and 

•  Community  effects. 


Power  laws: 

While  the  Gaussian  distribution  is  common  in  nature,  there  are  many  cases  where  the  probability 
of  events  far  to  the  right  of  the  mean  is  significantly  higher  than  in  Gaussians.  In  the  Internet, 
for  example,  most  routers  have  a  very  low  degree  (perhaps  “home”  routers),  while  a  few  routers 
have  extremely  high  degree  (perhaps  the  “core”  routers  of  the  Internet  backbone)  [  Tiloutsos  et  al., 
1999].  Power-law  distributions  attempt  to  model  this.  Some  of  the  following  were  defined  in 
Chapter  2;  we  repeat  the  definitions  here  for  completeness. 

Definition  16  (Power  Law).  Two  variables  x  and  y  are  related  by  a  power  law  when  their  scatter 
plot  is  linear  on  a  log-log  scale: 

y(x )  =  c  •  x ~7  (6.1) 

where  c  and  7  are  positive  constants.  The  constant  7  is  often  called  the  power  law  exponent. 

Definition  17  (Degree  Distribution).  The  degree  distribution  of  an  undirected  graph  is  a  plot  of 
the  count  Ck  of  nodes  with  degree  k,  versus  the  degree  k,  typically  on  a  log-log  scale.  Occasionally, 
the  fraction  f  is  used  instead  of  <77  however,  this  merely  translates  the  log-log  plot  downwards. 
For  directed  graphs,  outdegree  and  indegree  distributions  are  defined  similarly. 

Definition  18  (Scree  Plot).  This  is  a  plot  of  the  eigenvalues  (or  singular  values)  of  the  adjacency 
matrix  of  the  graph,  versus  their  rank,  using  a  log-log  scale. 

Both  degree  distributions  and  scree  plots  of  many  real-world  graphs  have  been  found  to  obey 
power  laws.  Examples  include  the  Internet  AS  and  router  graphs  [  kiloutsos  et  al.,  1999,  Govin- 
dan  and  Tangmunarunkit,  2000],  the  World-wide  Web  [Barabasi  and  Albert,  1999,  Kumar  et  al., 
1999,  Broder  et  al.,  2000,  Kleinberg  et  al.,  1999],  citation  graphs  [Redner,  1998],  online  social  net¬ 
works  [  hakrabarti  et  al.,  ],  and  many  others.  Power  laws  also  show  up  in  the  distribution  of 
“bipartite  cores”  (~  communities)  and  the  distribution  of  PageRank  values  [Brin  and  Page,  1998, 
Pandurangan  et  al.,  i  ].  Indeed,  power  laws  appear  to  be  a  defining  characteristic  of  almost  all 
large  real-world  graphs. 

Significance  of  power  laws:  The  significance  of  power  law  distributions  lies  in  the  fact  that  they 
are  heavy-tailed ,  meaning  that  they  decay  more  slowly  than  exponential  or  Gaussian  distributions. 
Thus,  a  power  law  degree  distribution  would  be  much  more  likely  to  have  nodes  with  a  very  high 
degree  (much  larger  than  the  mean)  than  the  other  two  distributions. 
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Deviations  from  power  laws:  Pennock  et  al.  [Pennock  et  al.,  2002]  and  others  have  observed  devi¬ 
ations  from  a  pure  power  law  distribution  in  several  datasets.  Two  of  the  more  common  deviations 
are  exponential  cutoffs  and  lognormals.  The  exponential  cutoff  models  distributions  which  look 
like  power  laws  over  the  lower  range  of  values  on  the  x-axis,  but  decay  exponentially  quickly  for 
higher  values;  examples  include  the  network  of  airports  [Amaral  et  al ,  2000].  Lognormal  distri¬ 
butions  look  like  truncated  parabolas  on  log-log  scales,  and  model  situations  where  the  plot  “dips” 
downwards  in  the  lower  range  of  values  on  the  x-axis;  examples  include  degree  distributions  of 
subsets  of  the  WWW,  and  many  others  [Pennock  et  al.,  2002,  Bi  et  al.,  2001]. 


Small  diameters: 

Several  definitions  of  the  term  “graph  diameter”  exist:  the  definition  we  use  is  called  the  “effective 
diameter”  or  “eccentricity,”  and  is  closely  related  to  the  “hop-plot”  of  a  graph;  both  of  these  are 
defined  below.  The  advantages  are  twofold:  (a)  the  “hop-plot”  and  “effective  diameter”  can  be 
computed  in  linear  time  and  space  using  a  randomized  algorithm  [Palmer  et  al.,  ZOO  ],  and  (b)  this 
particular  definition  is  robust  in  the  presence  of  outliers. 

Definition  19  (Hop-plot).  Starting  from  a  node  u  in  the  graph,  we  find  the  number  of  nodes  Nh(u ) 
in  a  neighborhood  of  h  hops.  We  repeat  this  starting  from  each  node  in  the  graph,  and  sum  the 
results  to  find  the  total  neighborhood  size  Nhfor  h  hops  (Nh  —  V(j  Nh(u)).  The  hop-plot  is  just 
the  plot  of  Nh  versus  h. 

Definition  20  (Effective  diameter).  This  is  the  minimum  number  of  hops  in  which  some  fraction 
(say,  90%)  of  all  connected  pairs  of  nodes  can  reach  each  other  [Tauro  et  a  ,  2001 ]. 

Significance  of  graph  diameter:  The  diameters  of  many  real-world  graphs  are  very  small  compared 
to  the  graph  size  [Albert  and  Barabasi,  2002]:  only  around  4  for  the  Internet  AS-level  graph,  12 
for  the  Internet  Router-level  graph,  16  for  the  WWW,  and  the  famous  “six  degrees  of  separation” 
in  the  social  network.  Any  realistic  graph  generator  needs  to  match  this  criterion. 


Community  Effects: 

Informally,  a  community  is  a  set  of  nodes  where  each  node  is  “closer”  to  the  other  nodes  within 
the  community  than  to  nodes  outside  it.  This  effect  has  been  found  (or  is  believed  to  exist)  in  many 
real-world  graphs,  especially  social  networks  [Moody,  2001,  Schwartz  and  Wood,  199  ]. 

Community  effects  have  typically  been  studied  in  two  contexts:  (a)  local  one-hop  neighbor¬ 
hoods,  as  characterized  by  the  clustering  coefficient,  and  (b)  node  groups  with  possibly  longer 
paths  between  members,  such  as  graph  partitions  and  bipartite  cores.  All  of  these  are  discussed 
below. 

Definition  21  (Clustering  Coefficient).  For  a  node  v  with  edges  ( u ,  v)  and  (v.  w),  the  clustering 
coefficient  of  v  measures  the  probability  of  existence  of  the  third  edge  ( u,w )  (Figure  6.1(a)).  The 
clustering  coefficient  of  the  entire  graph  is  found  by  averaging  over  all  nodes  in  the  graph1. 

'We  note  here  that  there  is  at  least  one  other  definition  based  on  counting  triangles  in  the  graph;  both  definitions 
make  sense,  but  we  use  this  one. 
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Set  L 


Set  R 


(a)  Clustering  coefficient  (b)  Bipartite  core 

Figure  6.1:  Indicators  of  community  structure:  (a)  Node  X  has  6  neighbors.  These  neighbors 
could  have  been  connected  by  (®)  =  15  edges,  but  only  5  such  edges  exist.  So,  the  local  clustering 
coefficient  of  node  A"  is  5/15  =  1/3.  (b)  A  4  x  3  bipartite  core,  with  each  node  in  Set  L  being 
connected  to  each  node  in  Set  R. 


Definition  22  (Graph  Partitioning).  Graph  partitioning  techniques  typically  break  the  graph  into 
two  disjoint  partitions  (or  communities)  while  optimizing  some  measure;  these  two  communities 
may  then  be  repartitioned  separately. 

The  popular  METIS  software  tries  to  find  the  best  separator,  minimizing  the  number  of  edges  cut 
in  order  to  form  two  disconnected  components  of  relatively  similar  sizes  [  larypis  and  Kumar, 

1998a],  Many  other  measures  and  techniques  exist  [Brandes  et  al.,  2003,  Alon,  1998].  Our  Cross¬ 
associations  method  also  finds  such  partitions,  while  optimizing  the  cost  of  encoding  the  adjacency 
matrix  of  the  graph,  as  we  have  seen  in  Chapter  5. 

Definition  23  (Bipartite  Core).  A  bipartite  core  in  a  graph  consists  of  two  (not  necessarily  dis¬ 
joint)  sets  of  nodes  L  and  R  such  that  every  node  in  L  links  to  every  node  in  R;  links  from  R  to  L 
are  optional  (Figure  6.1(b)). 

Significance  of  graph  communities:  Most  real-world  graphs  exhibit  strong  community  effects.  Moody  [  /loody, 
]  found  groupings  based  on  race  and  age  in  a  network  of  friendships  in  one  American  school, 

Schwartz  and  Wood  [  chwartz  and  Wood,  1993]  group  people  with  shared  interests  from  email 
logs,  Borgs  et  al.  [Borgs  et  al ,  2004]  find  communities  from  “cross-posts”  on  Usenet,  and  Flake  et  al.  [  lake 
et  al.,  2000]  discover  communities  of  webpages  in  the  WWW.  This  is  also  reflected  in  the  cluster¬ 
ing  coefficients  of  real-world  graphs:  they  are  almost  always  much  larger  than  in  random  graphs 
of  the  same  size  [Watts  and  Strogatz,  1998]. 


Other  Patterns: 

Many  other  graph  patterns  have  also  been  studied  in  the  literature.  We  mention  two  of  these  that 
we  will  use  later:  (a)  Edge-betweenness  or  Stress,  (b)  “Singular  vector  value”  versus  rank  plots, 
and  Resilience  under  attack. 

Definition  24  (Edge  betweenness  or  Stress).  Consider  all  shortest  paths  between  all  pairs  of 
nodes  in  a  graph.  The  edge-betweenness  or  stress  of  an  edge  is  the  number  of  these  shortest  paths 
that  the  edge  belongs  to,  and  is  thus  a  measure  of  the  “load”  on  that  edge. 
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Definition  25  (Stress  Plot).  This  is  a  plot  of  the  number  of  nodes  Sk  with  stress  k,  versus  k. 

Definition  26  (“Singular  vector  value”  versus  rank  plots).  The  “singular  vector  value  ”  of  a 
node  is  the  absolute  value  of  the  corresponding  component  of  the  first  singular  vector  of  the  graph. 
It  can  be  considered  to  be  a  measure  of  the  “importance  ”  of  the  node,  and  as  we  will  see  later, 
is  closely  related  to  the  widely-used  concept  of  “Bonacich  centrality”  in  social  network  analy¬ 
sis  [Bonacich,  198.  ]. 

Definition  27  (Resilience).  A  resilient  graph  is  one  whose  diameter  does  not  increase  when  its 
nodes  or  edges  are  removed  according  to  some  “attack”  process  [Palmer  et  al,  2002,  Albert 
et  al,  2000],  Most  real-world  graphs  are  very  resilient  against  random  failures,  but  susceptible  to 
targeted  attacks  (such  as  removed  of  nodes  of  the  highest  degree)  [  hngmunarunkit  et  al.,  2001 ]. 


6.1.2  Graph  Generators 

Graph  generators  allow  us  to  create  synthetic  graphs,  which  can  then  be  used  for,  say,  simulation 
studies.  However,  to  be  realistic,  the  generated  graph  must  match  all  (or  at  least  several)  of  the 
patterns  mentioned  above.  By  telling  us  which  processes  can  (or  cannot)  lead  to  the  development 
of  these  patterns,  graph  generators  can  provide  insight  into  the  creation  of  real-world  graphs. 

Graph  models  and  generators  can  be  broadly  classified  into  four  categories: 

1.  Random  graph  models:  The  graphs  are  generated  by  a  random  process.  The  basic  random 
graph  model  has  attracted  a  lot  of  research  interest  due  to  its  phase  transition  properties. 

2.  Preferential  attachment  models:  In  these  models,  the  “rich”  get  “richer”  as  the  network 
grows,  leading  to  power  law  effects.  Some  of  today’s  most  popular  models  belong  to  this 
class. 

3.  Optimization-based  models:  Here,  power  laws  are  shown  to  evolve  when  risks  are  mini¬ 
mized  using  limited  resources.  Together  with  the  preferential  attachment  models,  they  try  to 
provide  mechanisms  that  automatically  lead  to  power  laws. 

4.  Geographical  models:  These  models  consider  the  effects  of  geography  (i.e.,  the  positions 
of  the  nodes)  on  the  growth  and  topology  of  the  network.  This  is  especially  important  for 
modeling  router  or  power-grid  networks,  which  involve  laying  wires  between  points  on  the 
globe. 

We  will  briefly  touch  upon  some  of  these  generators  below.  Tables  6.1  and  6.2  provide  a 
taxonomy  of  generators. 


Random  graph  models: 

In  the  earliest  random  graph  model  [Erdos  and  Renyi,  1960],  we  start  with  N  nodes,  and  for  every 
pair  of  nodes,  an  edge  is  added  between  them  with  probability  p.  This  simple  model  leads  to  a 
surprising  list  of  properties,  including  phase  transitions  in  the  size  of  the  largest  component,  and 
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Generator 

Graph  type 

Degree  distributions 

Power  law 

Exponen¬ 

tial 

Undir. 

Dir. 

Bip. 

Self 

loops 

Mult. 

edges 

Geog. 

info 

Plain 

Exp. 

cutoff 

Devia¬ 

tion 

Erdos-Renyi  [Erdos  and  Renyi,  1960] 

yj 

yj 

yj 

yj 

PLRG  [Aiello  et  al.,  2000], 

PLOD  [Palmer  and  Steffan,  2000] 

y/ 

yj 

yj 

yj 

Exponential  cutoff 

[Newman  et  al.,  2001] 

V 

yj 

BA  [Barabasi  and  Albert,  1!  19] 

yj 

yj 

(7  =  3) 

AB  [Albert  and  Barabasi,  2000] 

y/ 

yj 

yj 

y/ 

yj 

Edge  Copying  [Kleinberg  et  al.,  1999], 
[Kumar  et  al.,  1999] 

yj 

yj 

yj 

GLP  [Bu  and  Towsley,  2002] 

y/ 

yj 

yj 

yj 

Accelerated  growth 

[Barabasi  et  al.,  2002] 

y/ 

yj 

yj 

Power-law  mixture  of 

7  =  2  and  7  =  3 

Fitness  model 

[Bianconi  and  Barabasi,  2001] 

yj 

yj 

(modified) 

Aiello  et  al.  [Aiello  et  al.,  2001] 

yj 

yj 

Pandurangan  et  al.  [Pandurangan  et  al.,  2002] 

yj 

yj 

yj 

yj 

Inet  [Winick  and  Jamin,  2002] 

V 

y/ 

Pennock  et  al.  [Pennock  et  al.,  2002] 

V 

yj 

yj 

y/ 

yj 

Small-world 

[Watts  and  Strogatz,  1998] 

yj 

yj 

yj 

Waxman  [Waxman,  1988] 

yj 

yj 

yj 

BRITE  [Medina  et  al.,  2000] 

y/ 

yj 

Yook  et  al.  [Yook  et  al.,  2002] 

V 

y/ 

yj 

yj 

Fabrikant  et  al.  [  al.,  2(  12] 

y/ 

yj 

yj 

R-MAT  [Chakrabarti  et  al.,  2004] 

y/ 

yj 

yj 

yj 

yj 

yj 

yj 

(DGX) 

Table  6.1:  Taxonomy  of  graph  generators:  This  table  shows  the  graph  types  and  degree  distri¬ 
butions  that  different  graph  generators  can  create.  The  graph  type  can  be  undirected,  directed, 
bipartite,  allowing  self-loops  or  multi-graph  (multiple  edges  possible  between  nodes).  The  degree 
distributions  can  be  power-law  (with  possible  exponential  cutoffs,  or  other  deviations  such  as  log- 
normal/DGX)  or  exponential  decay.  Empty  cells  indicate  that  the  corresponding  property  does  not 
occur  in  the  corresponding  model. 
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Generator 

Diameter  or 

Avg  path  len. 

Community 

Clustering 

coefficient 

Remarks 

Bip.  core 
vs  size 

C(k)  vi  k 

Erdos-Renyi  [  lyi,  1(  50] 

0(log  N) 

Indep. 

Low,  CC  oc  N~ 1 

PLRG  [Aiello  et  al.,  2000], 

PLOD  [Palmer  and  Steffan,  2000] 

OQogiV) 

Indep. 

CC-»  0 

for  large  N 

Exponential  cutoff 

[Newman  et  al.,  2001] 

0(log  N) 

CC^  0 
for  large  N 

BA  [Barabasi  and  Albert,  1!  J9] 

O  (log  TV)  or 

Of  logN  ) 

CC  tx  7V-U  Yb 

AB  [Albert  and  Barabasi,  2000] 

Edge  copying  [Kleinberg  et  al.,  1999], 
[Kumar  et  al.,  1999] 

Power-law 

GLP  [Bu  and  Towsley,  2002] 

Higher  than 
AB,  BA,  PLRG 

Internet 

only 

Accelerated  growth 

[Barabasi  et  al.,  2002] 

Non-monotonic 

with  N 

Fitness  model 

[Bianconi  and  Barabasi,  2001] 

Aiello  et  al.  [Aiello  et  al.,  2001] 

Pandurangan  et  al.  [Pandurangan  et  al.,  2002] 

Inet  [Winick  and  Jamin,  2002] 

Specific  to 
the  AS  graph 

Pennock  et  al.  [Pennock  et  al.,  2002] 

Small-world 

[Watts  and  Strogatz,  1998] 

O(N)  for  small  N, 
G(ln  N)  for  large  N, 
depends  on  p 

CC{p)  oc 
(1  -P)3, 

Indep  of  N 

A’=num  nodes 
p=rewiring  prob 

Waxman  [Waxman,  1988] 

BRITE  [Medina  et  al.,  2000] 

Low  (like  in  BA) 

like  in  BA 

BA  +  Waxman 

with  additions 

Yook  et  al.  [Yook  et  al.,  2002] 

Fabrikant  et  al.  [Fabrikant  et  al.,  2002] 

Tree,  density  1 

R-MAT  [Chakrabarti  et  al.,  2004] 

Low  (empirically) 

Table  6.2:  Taxonomy  of  graph  generators  (Contd.):  The  comparisons  are  made  for  graph  diameter, 
existence  of  community  structure  (number  of  bipartite  cores  versus  core  size,  or  Clustering  coef¬ 
ficient  CC(k)  of  all  nodes  with  degree  k  versus  k ),  and  clustering  coefficient.  N  is  the  number 
of  nodes  in  the  graph.  The  empty  cells  represent  information  unknown  to  the  authors,  and  require 
further  research. 
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diameter  logarithmic  in  graph  size.  Its  ease  of  analysis  has  proven  to  be  very  useful  in  the  early 
development  of  the  field.  However,  its  degree  distribution  is  Poisson,  whereas  most  real-world 
graphs  seem  to  exhibit  power  law  distributions.  Also,  the  graphs  lack  community  effects:  the 
clustering  coefficient  is  usually  far  smaller  than  that  in  comparable  real-world  graphs. 

The  basic  Erdos-Renyi  model  has  been  extended  in  several  ways,  typically  to  match  the  power 
law  degree  distribution  pattern  [Aiello  et  al.,  2000,  Palmer  and  Steffan,  2000,  Newman  et  al., 

2001].  These  methods  retain  the  simplicity  and  ease  of  analysis  of  the  original  model,  while 
removing  one  of  its  weaknesses:  the  unrealistic  degree  distribution.  However,  these  models  do  not 
describe  any  process  by  which  power  laws  may  arise  automatically,  which  makes  them  less  useful 
in  understanding  the  internal  processes  behind  graph  formation  in  the  real  world  (property  (P2)). 
Also,  most  models  make  no  attempt  to  match  any  other  patterns  (property  (PI)),  and  further  work 
is  needed  to  incorporate  community  effects  into  the  model. 


Preferential  attachment  models: 

First  developed  in  the  mid-1950s  [Simon,  1955],  the  idea  of  preferential  attachment  has  been  re¬ 
discovered  recently  due  to  their  ability  to  generate  skewed  distributions  such  as  power  laws  [Price, 
1976,  Barabasi  and  Albert,  1999].  Informally,  they  use  the  concept  of  the  “rich  getting  richer”  over 
time:  new  nodes  join  the  graph  each  timestep,  and  preferentially  connect  to  existing  nodes  with 
high  degree.  This  basic  idea  has  been  very  influential,  and  has  formed  the  basis  of  a  large  body  of 
further  work  [Albert  and  Barabasi,  2000,  Bu  and  Towsley,  2002,  Barabasi  et  al.,  2002,  Bianconi 
and  Barabasi,  2001,  Aiello  et  al.,  2001,  Bollobas  et  al.,  2003,  Pandurangan  et  al.,  2002,  Winick  and 
Jamin,  2002,  Pennock  et  al.,  2002,  Albert  and  Barabasi,  2002], 

The  preferential  attachment  models  have  several  interesting  properties: 

•  Power  law  degree  distributions:  These  models  lead  to  power  laws  as  a  by-product  of  the 
graph  generation  method,  and  not  as  a  specific  designed-in  feature. 

•  Low  diameter:  The  generated  graphs  have  O  (log  N)  diameter.  Thus,  the  increase  in  diame¬ 
ter  is  slower  than  the  growth  in  graph  size. 

•  Resilience:  The  generated  graphs  are  resilient  against  random  node/edge  removals,  but 
quickly  become  disconnected  when  nodes  are  removed  in  descending  order  of  degree  [Albert 
et  al.,  2000,  Palmer  et  al.,  2002],  This  matches  the  behavior  of  the  Internet. 

•  A  procedural  method:  Perhaps  most  importantly,  these  models  describe  a  process  that  can 
lead  to  realistic  graphs  and  power  laws  (matching  property  (P2)).  The  two  main  ideas  are 
those  of  (a)  growth,  and  (b)  preferential  attachment;  variants  to  the  basic  model  add  other 
ideas  such  as  node  “fitness.”  The  ability  of  these  models  to  match  many  real-world  graphs 
implies  that  graphs  in  the  real-world  might  indeed  have  been  generated  by  similar  processes. 

Preferential  attachment  models  are  probably  the  most  popular  models  currently,  due  to  their 
ability  to  match  power  law  degree  distributions  by  such  a  simple  set  of  steps.  However,  they 
typically  do  not  exhibit  community  effects,  and,  apart  from  the  paper  of  Pennock  et  al.  [  Pennock 
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et  al.,  2002],  little  effort  has  gone  into  finding  reasons  for  deviations  from  power  laws  in  real-world 
graphs  (property  (PI)). 

One  set  of  related  models  has  shown  promise  recently:  these  are  the  “edge  copying”  mod¬ 
els  [  deinberg  et  al ,  1999,  Kumar  et  al ,  1999],  where  a  node  (such  as  a  website)  acquires  edges  by 
copying  links  from  other  nodes  (websites).  This  is  similar  to  preferential  attachment  because  pages 
with  high-degree  will  be  linked  to  by  many  other  pages,  and  so  have  a  greater  chance  of  getting 
copied.  However,  such  graphs  can  be  expected  to  have  a  large  number  of  bipartite  cores  (which 
leads  to  the  community  effect).  This  makes  the  edge-copying  technique  a  promising  research  di¬ 
rection. 


Geographical  models: 

Several  models  introduce  the  constraints  of  geography  into  network  formation.  For  example,  it  is 
easier  (cheaper)  to  link  two  routers  which  are  physically  close  to  each  other;  most  of  our  social 
contacts  are  people  we  meet  often,  and  who  consequently  probably  live  close  to  us  (say,  in  the  same 
town  or  city),  and  so  on.  In  this  area,  there  are  two  important  models:  the  Small-World  model,  and 
the  Waxman  model. 

The  Small-World  model  [Watts  and  Strogatz,  1998]  starts  with  a  regular  lattice  and  adds/rewires 
some  edges  randomly.  The  original  lattice  represents  ties  between  close  friends,  while  the  random 
edges  link  “acquaintances,”  and  serve  to  connect  different  social  circles.  Thus,  the  resulting  graph 
has  low  diameter  but  a  high  clustering  coefficient  —  two  patterns  common  to  real-world  graphs. 
However,  the  degree  distribution  is  not  a  power  law,  and  the  basic  model  needs  extensions  to  match 
this  (property  (PI)). 

The  Waxman  model  [Waxman,  1988]  probabilistically  links  two  nodes  in  the  graph,  based  on 
their  geographical  distance  (in  fact,  the  probability  decreases  exponentially  with  distance).  The 
model  is  simple  yet  attractive,  and  has  been  incorporated  into  the  popular  BRITE  [  /ledina  et  a  , 
]  generator  used  for  graph  generation  in  the  networking  community.  However,  it  does  not  yield 
a  power  law  degree  distribution,  and  further  work  is  needed  to  analyze  the  other  graph  patterns  for 
this  generator  (property  (PI)). 


Optimization-based  models: 

Optimization-based  models  provide  another  process  leading  to  power  laws.  Carlson  and  Doyle  [  Tail- 
son  and  Doyle,  1999,  Doyle  and  Carlson,  2000]  propose  in  their  Highly  Optimized  Tolerance 
(HOT)  model  that  power  laws  may  arise  in  systems  due  to  tradeoff's  between  yield  (or  profit), 
resources  (to  prevent  a  risk  from  causing  damage)  and  tolerance  to  risks.  Apart  from  power  laws, 
HOT  also  matches  the  resilience  pattern  of  many  real-world  graphs  (see  Definition  27):  it  is  robust 
against  “designed-for”  uncertainties,  but  very  sensitive  to  design  flaws  and  unanticipated  perturba¬ 
tions. 

Several  variations  of  the  basic  model  have  also  been  proposed:  COLD  [  'Jewman  et  a  ,  ] 

truncates  the  power  law  tails,  while  the  Heuristically  Optimized  Tradeoffs  model  [  hbrikant  et  a  , 

]  needs  only  locally-optimal  decisions  instead  of  global  optimality. 
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Figure  6.2:  The  R-MAT  model:  The  adjacency  matrix  is  broken  into  four  equal-sized  partitions, 
and  one  of  those  four  is  chosen  according  to  a  (possibly  non-uniform)  probability  distribution.  This 
partition  is  then  split  recursively  till  we  reach  a  single  cell,  where  an  edge  is  placed.  Multiple  such 
edge  placements  are  used  to  generate  the  full  synthetic  graph. 


Optimization-based  models  provide  a  very  intuitive  process  which  leads  to  both  power  laws  and 
resilience  (matching  property  (P2)).  However,  further  research  needs  to  be  conducted  on  other  pat¬ 
terns  for  such  graphs  (property  (PI)).  One  step  in  this  direction  is  the  work  of  Berger  et  al.  [Berger 
et  al ,  2005],  who  generalize  the  Heuristically  Optimized  Tradeoffs  model,  and  show  that  it  is 
equivalent  to  a  form  of  preferential  attachment;  thus,  competition  between  opposing  forces  can 
give  rise  to  preferential  attachment,  and  we  already  know  that  preferential  attachment  can,  in  turn, 
lead  to  power  laws  and  exponential  cutoffs. 


Summary  of  graph  generators 

Thus,  we  see  that  these  are  four  general  categories  of  graph  generators,  along  with  several  “cross¬ 
category”  generators  too.  Most  of  the  generators  do  well  with  properties  (P3)-(P6),  but  need 
further  research  to  determine  their  realism  (property  (PI),  that  is,  which  patterns  they  match,  and 
which  they  don’t).  Next,  we  will  discuss  our  R-MAT  graph  generator,  which  attempts  to  match 
all  of  these  properties,  including  the  degree  distribution  (with  both  power  laws  and  deviations), 
community  structure,  singular  vector  value  versus  rank  patterns  and  so  on. 


6.2  The  R-MAT  methodology 

We  have  seen  that  most  of  the  current  graph  generators  focus  on  only  one  graph  pattern  -  typically 
the  degree  distribution  -  and  give  low  importance  to  all  the  others.  There  is  also  the  question  of 
how  to  fit  model  parameters  to  match  a  given  graph.  What  we  would  like  is  a  tradeoff  between 
parsimony  (property  (P3)),  realism  (property  (PI)),  and  efficiency  (properties  (P4)  and  (P5)).  In 
this  section,  we  present  the  R-MAT  generator,  which  attempts  to  address  all  of  these  concerns. 

Description  and  properties: 

R-MAT  is  based  on  the  well-known  “80  —  20  rule”  in  one  dimension  (80%  of  the  data  falls 
on  20%  of  the  data  range),  which  is  known  to  result  in  self- similarity  and  power  laws;  R-MAT 
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extends  this  rule  to  a  two-dimensional  graph  adjacency  matrix.  The  R-MAT  generator  creates 
directed  graphs  with  2n  nodes  and  E  edges,  where  both  values  are  provided  by  the  user.  We  start 
with  an  empty  adjacency  matrix,  and  divide  it  into  four  equal-sized  partitions.  One  of  the  four 
partitions  is  chosen  with  probabilities  a,  b,  c,  d  respectively  (a  +  b  +  c  +  d  =  1),  as  in  Figure  6.2. 
The  chosen  partition  is  again  subdivided  into  four  smaller  partitions,  and  the  procedure  is  repeated 
until  we  reach  a  simple  cell  (=1  x  1  partition).  The  nodes  (that  is,  row  and  column)  corresponding 
to  this  cell  are  linked  by  an  edge  in  the  graph.  This  process  is  repeated  E  times  to  generate  the  full 
graph.  There  is  a  subtle  point  here:  we  may  have  duplicate  edges  (i.e.,  edges  which  fall  into  the 
same  cell  in  the  adjacency  matrix),  but  we  only  keep  one  of  them  when  generating  an  unweighted 
graph.  To  smooth  out  fluctuations  in  the  degree  distributions,  some  noise  is  added  to  the  (a,  6,  c,  d) 
values  at  each  stage  of  the  recursion,  followed  by  renormalization  (so  that  a  +  b  +  c  +  d  =  1). 
Typically,  a  >  b,  a  >  c,  a  >  d. 

Parsimony:  The  algorithm  needs  only  three  parameters:  the  partition  probabilities  a,  b,  and  c; 
d  =  1  —  a  —  b  —  c.  Thus,  the  models  is  parsimonious. 

Degree  distribution:  The  following  theorem  gives  the  expected  degree  distribution  of  an  R-MAT 
generated  graph. 

Theorem  4  (Count- vs-degree).  For  a  pure  R-MAT  generated  graph  (ie.,  without  any  smoothing 
factors ),  the  expected  number  of  nodes  Ck  with  outdegree  k  is  given  by 

Ck=  (f)E  (")  [p“-‘(l-p)T*  ‘  (6-2) 

where  2n  is  the  number  of  nodes  in  the  R-MAT  graph  (typically  n  =  |~log2  N ]  and  p  =  a  +  b. 

Proof  Proved  in  Section  6.4.  £3 

This  is  well  modeled  by  a  discrete  lognormal  [Bi  et  al.,  '  ],  which  looks  like  a  truncated 

parabola  on  the  log-log  scale.  By  setting  the  parameters  properly,  this  can  successfully  match  both 
power-law  and  “unimodal”  distributions  [Pennock  et  al.,  ]. 

Communities:  Intuitively,  R-MAT  is  generating  “communities”  in  the  graph: 

•  The  partitions  a  and  d  represent  separate  groups  of  nodes  which  correspond  to  communities 
(say,  “Linux”  and  “Windows”  users). 

•  The  partitions  b  and  c  are  the  cross-links  between  these  two  groups;  edges  there  would  denote 
friends  with  separate  preferences. 

•  The  recursive  nature  of  the  partitions  means  that  we  automatically  get  sub-communities 
within  existing  communities  (say,  “RedHat”  and  “Mandrake”  enthusiasts  within  the  “Linux” 
group). 

Diameter,  singular  values  and  other  properties:  We  show  experimentally  that  graphs  generated  by 
R-MAT  have  small  diameter  and  match  several  other  criteria  as  well. 
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Extensions  to  undirected,  bipartite  and  weighted  graphs:  The  basic  model  generates  directed  graphs; 
all  the  other  types  of  graphs  can  be  easily  generated  by  minor  modifications  of  the  model.  For  undi¬ 
rected  graphs,  a  directed  graph  is  generated  and  then  made  symmetric.  For  bipartite  graphs,  the 
same  approach  is  used;  the  only  difference  is  that  the  adjacency  matrix  is  now  rectangular  instead 
of  square.  For  weighted  graphs,  the  number  of  duplicate  edges  in  each  cell  of  the  adjacency  matrix 
is  taken  to  be  the  weight  of  that  edge.  More  details  may  be  found  in  [  Ihakrabarti  et  al.,  2004]. 


Parameter  fitting  algorithm:  We  are  given  some  input  graph,  and  need  to  fit  the  R-MAT  model 
parameters  so  that  the  generated  graph  matches  the  input  graph  in  terms  of  graph  patterns.  Using 
Theorem  4  we  can  fit  the  indegree  and  outdegree  distributions;  this  gives  us  two  equations  (specif¬ 
ically,  we  get  the  values  of  p  =  a  +  b  and  q  =  a  +  c).  We  need  one  more  equation  to  fit  the  three 
model  parameters. 

We  tried  several  experiments  where  we  fit  the  scree  plot  (see  Definition  18).  However,  we 
obtained  comparable  (and  much  faster)  results  by  conjecturing  that  the  a  :  b  and  a  :  c  ratios  are 
approximately  75  :  25  (as  seen  in  many  real  world  scenarios),  and  using  these  to  fit  the  parameters. 
Hence,  this  is  our  current  parameter-fitting  method  for  R-MAT. 


Relationship  with  Cross-Associations:  In  Chapter  5,  we  described  our  Cross- Associations  algo¬ 
rithm  to  extract  “natural”  communities  in  graphs,  and  R-MAT  specifically  generates  graphs  with  a 
recursive  community  structure.  The  similarity  between  these  two,  however,  is  not  very  deep.  While 
R-MAT  creates  a  community  “hierarchy,”  Cross-Associations  finds  only  a  flat  set  of  communities, 
which  need  not  strictly  correspond  to  the  R-MAT  communities  at  any  level  of  the  hierarchy.  The 
real  linkage  is  in  the  sizes  of  the  communities  found:  when  the  R-MAT  parameters  are  very  skewed 
(say,  a  d),  the  communities  found  by  Cross-Associations  have  very  skewed  sizes,  with  several 
very  large  sparse  communities  and  a  few  extremely  small  densely-connected  communities.  Con¬ 
versely,  when  the  R-MAT  parameters  are  more  equal,  the  community  sizes  are  more  similar  to 
each  other. 


Relationship  with  tree-based  generators:  The  process  of  placing  edges  during  R-MAT  graph  gen¬ 
eration  can  be  thought  of  as  a  tree  traversal  from  root  to  leaf:  each  internal  node  of  this  tree  has 
four  children  (one  for  each  “quadrant”  in  R-MAT),  the  leaves  of  the  tree  are  the  individual  cells  of 
the  adjacency  matrix,  and  each  “choosing  of  a  quadrant”  in  R-MAT  corresponds  to  a  descent  of 
one  level  in  this  tree.  This  is  similar  in  spirit  to  other  tree-based  graph  generators  [  leinberg,  100  , 
Watts  et  al.,  2002]  which  use  hierarchies  in  forming  links  between  nodes;  however,  the  differences 
are  significant.  The  R-MAT  tree  has  all  possible  edges  as  its  leaves,  while  the  hierarchy-based 
generators  have  nodes  as  leaves.  The  latter  use  a  tree-distance  metric  to  compute  the  probability 
of  an  edge,  but  the  usefulness  of  tree-distance  in  the  R-MAT  tree  is  not  clear.  (What  does  the  tree- 
distance  between  two  edges  mean?)  The  R-MAT  tree  automatically  generates  the  realistic  degree 
distribution  described  above,  while  for  the  other  tree-based  methods,  the  degree  distribution  also 
depends  on  how  many  edges  we  attach  to  each  node;  hence,  a  large  number  of  free  parameters 
need  to  be  set.  Thus,  while  R-MAT  is  conceptually  related  to  these  tree-based  methods,  it  is  also 
significantly  different. 
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6.3  Experiments 

We  show  experiments  on  the  following  graphs: 

•  The  EPINIONS  directed  graph:  A  graph  of  who-trusts-whom  from  epinions.com  [Richard¬ 
son  and  Domingos,  2(  ]:  N  —  75,  879;  E  =  508,  960. 

•  The  OREGON  directed  graph:  A  graph  of  connections  between  Internet  Autonomous  Sys¬ 
tems,  obtained  from  http  :  /  /topology  .  eecs  .  umich .  edu/ data .  html:  N  =  11, 461;  E 
32,  730. 

•  The  CITATIONS  directed  graph:  A  graph  of  paper  connections,  obtained  from  the  KDD  Cup  (2003) 
website:  http  :  / / www .  cs  .  Cornell .  edu/pro  jects/kddcup/ datasets  .  html: 

N  =  25,  059;  E  =  352,807. 

•  The  CLICKSTREAM  bipartite  graph:  A  graph  of  the  browsing  behavior  of  Internet  users  [  lont- 
gomery  and  Faloutsos,  2001].  An  edge  (u,p)  denotes  that  user  u  accessed  page  p.  It  has 
23,  396  users,  199,  308  pages  and  952,  580  edges. 

•  The  EPINIONS-U  undirected  graph:  This  is  an  undirected  version  of  the  EPINIONS  graph: 

N  =  75,  879;  E  =  811,602. 

The  questions  we  want  to  answer  are: 

•  (Ql)  How  well  does  R-MAT  match  directed  graphs,  such  EPINIONS,  OREGON  and  CITA¬ 
TIONS' ? 

•  (Q2)  How  well  does  R-MAT  match  bipartite  graphs,  such  as  CLICKSTREAM1 

•  (Q3)  How  well  does  R-MAT  match  undirected  graphs,  such  as  EPINIONS-U ? 

For  each  of  the  datasets,  we  fit  the  R-MAT  parameters,  and  compare  the  true  graph  with  one 
generated  by  R-MAT.  We  also  compare  R-MAT  with  three  existing  generators  chosen  for  their 
popularity  or  recency: 

•  The  AB  model  [Albert  and  Barabasi,  2000]:  This  is  a  preferential  attachment  model  with 
an  additional  process  to  rewire  edges  and  add  links  between  existing  nodes  (instead  of  only 
adding  links  to  new  nodes). 

•  The  Generalized  Linear  Preference  ( GLP)  model  [  i  and  Towsley,  2002] :  This  modifies  the 
original  preferential  attachment  equation  with  an  extra  parameter,  and  is  highly  regarded  in 
the  networking  community. 

•  The  PG  model  [Pennock  et  a  ,  ] :  This  model  has  a  parameter  to  traverse  the  continuum 

from  pure  preferential  attachment  to  pure  random  attachment. 

There  are  two  important  points  to  note: 
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Dataset 

Graph  type 

Dimensions 

Edges 

R- 

a 

MAT  p 

b 

arametc 

c 

03 

U 

EPINIONS 

Directed 

75,879x75,879 

508,960 

0.56 

0.19 

0.18 

0.07 

OREGON 

Directed 

11,461x11,461 

32, 730 

0.597 

0.165 

0.233 

0.005 

CITATIONS 

Directed 

25,059x25,059 

352,807 

0.538 

0.111 

0.248 

0.103 

CLICKSTREAM 

Bipartite 

23,396x199,308 

952,580 

0.50 

0.15 

0.19 

0.16 

EPINIONS-U 

Undirected 

75,879x75,879 

811,602 

0.55 

0.18 

0.18 

0.09 

Table  6.3:  R-MAT  parameters  for  the  datasets 


•  All  of  the  above  models  are  used  to  generate  undirected  graphs,  and  thus,  we  can  compare 
them  to  R-MAT  only  on  EPINIONS-U. 

•  We  were  unaware  of  any  method  to  fit  the  parameters  of  these  models,  so  we  fit  them  using 
a  brute-force  method.  We  use  AB+,  PG+  and  GLP+  for  the  original  algorithms  augmented 
by  our  parameter  fitting. 

The  graph  patterns  we  look  at  are: 

1.  Both  indegree  and  outdegree  distributions  (Definition  17). 

2.  “Hop-plot”  and  “effective  diameter”  (Definitions  19  and  20). 

3.  Singular  value  vs.  rank  plot  (also  known  as  the  scree  plot,  see  Definition  18). 

4.  “Singular  vector  value”  versus  rank  plots  (Definition  26). 

5.  “Stress”  distribution  (Definition  25). 


6.3.1  (Ql)  R-MAT  on  Directed  Graphs 

Figures  6.3,  6.4  and  6.5  show  results  on  the  EPINIONS,  OREGON  and  CITATIONS  directed 
graphs.  The  R-MAT  fit  is  very  good;  the  other  models  considered  are  not  applicable.  The  cor¬ 
responding  R-MAT  parameters  are  shown  in  Table  6.3. 


6.3.2  (Q2)  R-MAT  on  Bipartite  Graphs 

Figure  6.6  shows  results  on  the  CLICKSTREAM  bipartite  graph.  As  before,  the  R-MAT  fit  is 
very  good.  In  particular,  note  that  the  indegree  distribution  is  a  power  law  while  the  outdegree 
distribution  deviates  significantly  from  a  power  law;  R-MAT  matches  both  of  these  very  well. 
Again,  the  other  models  are  not  applicable. 
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Figure  6.3:  Results  on  the  EPINIONS  directed  graph:  The  AB+,  PG+  and  GLP+  methods  do  not 
apply.  The  crosses  and  dashed  lines  represent  the  R-MAT  generated  graphs,  while  the  pluses  and 
strong  lines  represent  the  real  graph. 
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Figure  6.4:  Results  on  the  OREGON  directed  graph:  The  AB+,  PG+  and  GLP+  methods  do  not 
apply.  The  crosses  and  dashed  lines  represent  the  R-MAT  generated  graphs,  while  the  pluses  and 
strong  lines  represent  the  real  graph. 


81 


Left  singular  vector  elements 


inp-Citations  In-degree 

R-MAT  In-degree 


V 


100 

In-degree 


(a) Indegree 


(c)  Scree  plot 


inp-Citations  Out-degree 

R-MAT  Out-degree 


1000  10000 


\ 


10  100 
Out-degree 


(b)  Outdegree 


(d)  Hop-plot 


(e)  First  left  singular  vector  (f)  First  right  singular  vector 


(g)  Stress 


Figure  6.5:  Results  on  the  CITATIONS  directed  graph:  The  AB+,  PG+  and  GLP+  methods  do 
not  apply.  The  crosses  and  dashed  lines  represent  the  R-MAT  generated  graphs,  while  the  pluses 
and  strong  lines  represent  the  real  graph. 
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Figure  6.6:  Results  on  the  CLICKSTREAM  bipartite  graph:  The  AB+,  PG+  and  GLP+  methods 
do  not  apply.  The  crosses  and  dashed  lines  represent  the  R-MAT  generated  graphs,  while  the 
pluses  and  strong  lines  represent  the  real  graph. 
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Figure  6.7:  Results  on  the  EPINIONS-U  undirected  graph:  We  show  (a)  degree,  (b)  hop-plot, 
(c)  singular  value,  (d)  “singular  vector  value,”  and  (e)  stress  distributions  for  the  EPINIONS- 
U  dataset.  R-MAT  gives  the  best  matches  to  the  EPINIONS-U  graph,  among  all  the  generators.  In 
fact,  for  the  stress  distribution,  the  R-MAT  and  EPINIONS-U  plots  are  almost  indistinguishable. 


6.3.3  (Q3)  R-MAT  on  Undirected  Graphs 

Figure  6.7  shows  the  comparison  plots  on  the  EPINIONS-U  undirected  graph.  R-MAT  gives  the 
closest  fits.  Also,  note  that  all  the  y-scales  are  logarithmic,  so  small  differences  in  the  plots  actually 
represent  significant  deviations. 


6.4  Details  of  Proofs 

Theorem  1  (Count- vs-degree).  For  a  pure  R-MAT  generated  graph  (ie.,  without  any  smoothing 
factors),  the  expected  number  of  nodes  Cf,  with  outdegree  k  is  given  by 

<*  =  (f)E  [p’Ai-riT*  (6.3) 

where  2n  is  the  number  of  nodes  in  the  R-MAT  graph  and  p  =  a  +  b. 

Proof  In  the  following  discussion,  we  neglect  the  elimination  of  duplicate  edges.  This  is  a  rea¬ 
sonable  assumption:  in  most  of  our  experiments,  we  found  that  the  number  of  duplicate  edges  is 
far  less  than  the  total  number  of  edges.  Each  edge  that  is  “dropped”  on  to  the  adjacency  matrix 
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takes  a  specific  path:  at  each  stage  of  the  recursion,  the  edge  chooses  either  Up  (corresponding  to 
partitions  a  or  b)  or  Down  (corresponding  to  partitions  c  or  d).  There  are  n  such  stages,  where  2" 
is  the  number  of  nodes.  Row  X  can  be  reached  only  if  the  edge  follows  a  unique  ordered  sequence 
of  Up  and  Down  choices.  Since  the  probability  of  choosing  Up  is  p  =  a  +  b  and  the  probability  of 
choosing  Down  is  1  —  p  =  c  +  d,  the  probability  of  the  edge  falling  to  row  X  is 


P(X) 


^num(Up)  ^  _  mum(Down) 
^num(Up)  ^  _  ^n-num(Up) 


(6.4) 


This  equation  means  that  any  other  row  which  requires  the  same  number  of  Up  choices  will 
have  the  same  probability  as  row  X.  The  number  of  such  rows  can  be  easily  seen  to  be  (num(upj)  • 
Thus,  we  can  think  of  different  classes  of  rows: 


Class 

Probability  of  getting  an  edge 

Num(rows) 

0 

pn 

R 

1 

pn~ 1  ( 1  —  p)1 

(?) 

i-1 

pn~i(i  —  py 

(?) 

n 

(l  -p)n 

(?) 

Now,  we  can  calculate  the  count-degree  plot.  Let 


NRk  =  number  of  rows  with  outdegree  k 

—  iVi?o,fc  +  NRi,k  +  •  •  •  +  NRH)k  (6.5) 

where  NR ^  =  number  of  rows  of  Class  i  with  outdegree  k.  Thus,  the  expected  number  of  rows 
with  outdegree  k  is: 

n 

E[NRk}=YJE[NRlX]  (6.6) 

i=0 

Now,  for  a  row  in  Class  i,  each  of  the  E  edges  can  either  drop  into  it  (with  probability  pn~l(l  —  //)') 
or  not  (with  probability  1  —  pn~l{l  —  //)').  Thus,  the  number  of  edges  falling  into  this  row  is  a 
binomially  distributed  random  variable:  Bin(E,pn~l(  1  —  p )*).  Thus,  the  probability  that  it  has 
exactly  k  edges  is  given  by 

pi,k  =  (f)  [p“-(i  -p)T  [i  -p”-f(i  ~pr]E~k 


Thus,  the  expected  number  of  such  rows  from  Class  i  is 

E[NRitk]  =  number  of  rows  in  Class  i  *  Pi)k 


*  Pi,k 

E 
k 


[pn-i(i-py]‘ 


*  [i-pn-\i-pf 


E-k 


(6.7) 
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Using  this  in  Equation  6.5  gives  us: 


E[NRk] 


k 


*  [i  - pn-\i  - Py]E  k 


(6.8) 


Equation  6.8  gives  us  the  count  of  nodes  with  outdegree  k;  thus  we  can  plot  the  count-vs-outdegree 


plot  using  this  equation. 


□ 


6.5  Summary 


We  presented  the  R-MAT  graph  generator,  a  simple  parsimonious  model  for  efficiently  generating 
realistic  graphs.  Instead  of  focusing  on  just  one  pattern  like  the  degree  distributions  (as  most 
current  generators  do),  R-MAT  attempts  to  match  several  patterns,  including  diameter  and  hop- 
plot,  the  scree  plot  of  singular  values  versus  rank,  “stress”  plots  and  others.  We  experimentally 
demonstrate  our  parameter-fitting  algorithm,  and  the  ability  of  R-MAT  to  successfully  mimic  the 
patterns  found  in  several  real-world  graphs.  This  opens  up  the  possibility  of  using  R-MAT  for 
many  applications:  generating  small  samples  “similar”  to  a  large  given  graph,  or  for  simulation 
studies  when  a  “true”  real-world  graph  may  not  exist  yet,  or  may  be  too  costly  to  obtain,  or  for  use 
in  graph  compression  algorithms. 
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Chapter  7 
Conclusions 


Graphs  are  ubiquitous;  they  show  up  in  fields  as  diverse  as  ecological  studies,  sociology,  computer 
networking  and  many  others.  In  fact,  any  information  relating  different  entities  (an  M  :  N  re¬ 
lationship  in  database  terminology)  can  be  thought  of  as  a  graph.  Mining  the  hidden  patterns  in 
graphs  helps  define  what  is  “normal”  for  real-world  graphs,  and  deviations  from  such  patterns  can 
imply  abnormal  graphs/subgraphs. 

There  is  a  dichotomy  in  graph  mining  applications:  we  can  answer  specific  queries  on  a  par¬ 
ticular  graph,  or  we  can  ask  questions  pertaining  to  real-world  graphs  in  general.  In  my  thesis,  I 
explored  issues  from  both  sides  of  this  dichotomy. 

•  Problems  for  specific  graphs: 

-  How  does  a  virus  propagate  over  the  given  graph?  When  does  a  viral  outbreak  die  out? 

-  Under  what  conditions  does  a  piece  of  information  survive  in  a  given  sensor  network 
with  failing  nodes  and  links? 

-  How  can  we  automatically  cluster  nodes  in  a  given  graph?  How  can  we  quickly  and 
automatically  estimate  the  number  of  clusters  in  the  data? 

•  Real-world  graphs  in  general: 

-  What  patterns  and  “laws”  hold  for  most  real-world  graphs?  How  can  we  generate 
synthetic  yet  “realistic”  graphs? 


Viral  propagation:  A  quantitative  and  analytic  understanding  of  the  propagation  behavior  of  viruses 
is  critically  important  for  many  problems,  including: 

•  the  design  of  new  anti-virus  measures  that  combat  the  virus  globally ,  that  is,  over  the  entire 
network, 

•  the  determination  of  “susceptibility”  of  a  computer  (or  social)  network  to  viral  infections, 

•  the  design  of  new  “virus-resistant”  networks,  and 
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•  the  “optimal”  immunization  strategies  to  best  prevent  the  spread  of  a  virus,  and  avoid  an 
epidemic. 

In  addition,  this  work  is  not  limited  to  viruses:  the  same  principles  also  apply  to  the  spread  of 
rumors  and  fashions,  to  viral  marketing,  to  information  dissemination  campaigns,  and  the  like. 

We  formulated  the  virus  propagation  problem  as  a  non-linear  dynamical  system  (called  NLDS), 
which  requires  is  linear  in  the  size  of  the  network,  and  is  tractable  using  analytic  techniques.  Our 
main  contributions,  apart  from  the  design  of  NLDS,  are: 

•  Accuracy  of  NLDS;  We  show  that  the  number  of  infected  nodes  in  the  graph  for  NLDS  is 
very  close  to  that  for  the  full  Markov  chain,  for  a  variety  of  synthetic  and  real-world  graphs. 
Thus,  our  approximation  is  very  accurate. 

•  The  Epidemic  Threshold:  We  derived  the  epidemic  threshold  below  which  a  viral  outbreak 
is  expected  to  die  out,  but  above  which  it  might  survive  for  long.  Surprisingly,  this  condition 
depends  only  on  one  property  of  the  graph:  its  largest  eigenvalue. 

Information  survival  threshold:  The  problem  of  information  survival  in  sensor  networks  is  similar, 
but  the  occasional  failure  of  nodes  in  the  network  adds  more  complexity.  Our  contributions  are: 

•  Problem  formulation  methodology:  We  again  demonstrated  the  formulation  of  the  problem 
as  a  non-linear  dynamical  system,  thus  proving  the  generality  of  this  approach. 

•  The  Information  Survival  Threshold:  Solving  the  non-linear  dynamical  system  allows  us  to 
find  the  information  survival  threshold  for  any  given  sensor  or  P2P  network.  This  is  found 
to  depend  on  the  largest  value  of  a  properly  constructed  matrix,  which  links  together  the  link 
qualities  and  failure  rates  in  the  sensor  network. 

As  sensor  networks  become  more  and  more  commonplace,  this  result  will  be  important  in  deter¬ 
mining  the  “informational  load”  we  can  place  on  the  network. 

Automatic  clustering  of  nodes: 

Clustering  the  nodes  in  a  graph  is  a  powerful  way  to  mine  the  data  in  a  graph.  It  tells  us  the 
major  “classes”  of  nodes,  and  their  behaviors  (which  other  “classes”  do  they  link  to  strongly).  The 
clusters  can  be  used  in  visualization  tools  to  quickly  provide  a  summary  of  a  large  graph  dataset. 
Clustering  has  applications  in  personalization  services,  consumer  segmentation,  fraud  detection, 
collaborative  filtering;  the  list  goes  on. 

We  developed  an  efficient  and  accurate  method  of  finding  clusters  in  large  graphs,  with  many 
useful  properties: 

•  Completely  automatic:  Our  clustering  method  is  one  of  the  few  that  needs  no  “magic” 
numbers.  It  automatically  figures  out  both  the  number  of  clusters  in  the  data,  and  their 
memberships . 

•  Scalable:  The  method  is  linear  in  the  number  of  edges  in  the  graph,  and  is  thus  scalable  to 
large  graphs. 
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•  Incremental:  New  data  can  easily  be  incorporated  into  the  current  clustering  solution,  thus 
avoiding  the  costs  of  full  recomputation. 

•  Applicability  to  both  self-  and  bipartite-graphs:  The  algorithm  can  be  used  for,  say,  both 
customer-product  data  as  well  as  social-network  data. 

•  Outlier  detection:  In  addition  to  clustering,  we  also  proposed  efficient  methods  to  detect 
outlier  edges  in  the  data,  and  rank  them  in  decreasing  order  of  “outliemess.” 

•  Inter-cluster  “distances:  ”  An  intuitive  metric  was  proposed  as  a  “distance”  between  clusters, 
and  an  efficient  algorithm  to  compute  it  was  developed. 


The  R-MAT  graph  generator: 

A  good  graph  generator  is  an  essential  part  of  a  graph  miner’s  toolbox.  The  generator  must  be 
able  to  match  the  common  graph  patterns  and  “laws,”  besides  being  efficient,  parsimonious,  and 
having  a  simple  parameter-fitting  algorithm.  The  uses  are  many: 

•  Simulation  studies:  Algorithms  can  be  tested  on  synthetic  yet  realistic  graphs,  where  obtain¬ 
ing  real-world  graphs  might  be  costly  or  even  impossible. 

•  Outlier  detection:  Deviations  from  the  common  patterns  can  signify  outliers,  which  need  to 
be  investigated  further. 

•  Realism  of  samples:  Samples  of  graphs  can  be  tested  for  the  graph  patterns  to  verify  their 
realism. 

•  Extrapolation  of  data:  The  generator  can  be  used  in  “what-if”  scenarios,  to  visualize  what 
current  graphs  would  look  like  when  they  grow  by  x%. 

Our  R-MAT  graph  generator  is  exactly  is  a  step  in  this  direction:  it  matches  many  of  the  graph 
patterns,  while  requiring  only  three  parameters;  the  generation  process  is  efficient  and  scalable; 
and,  the  same  basic  method  can  be  used  to  generate  weighted  or  unweighted,  directed  or  undi¬ 
rected,  self-  and  bipartite  graphs.  Indeed,  we  experimentally  demonstrated  the  ability  of  R-MAT 
to  fit  several  large  real-world  datasets  of  different  types. 

To  conclude,  we  developed  tools  for  mining  graph  datasets  under  a  variety  of  circumstances, 
from  the  propagation  characteristics  of  real-world  graphs,  to  the  clusters  hidden  within  them,  and 
finally  on  to  the  patterns  and  laws  that  govern  them,  and  a  generator  to  mimic  these.  Each  of  these 
is  an  important  tool  in  its  own  right;  combined  together,  their  usefulness  is  increased  even  more. 
“How  susceptible  will  the  Internet  be  to  viral  infections  if  its  grows  by  x%  nodes  and  y%  edges?” 
Our  work  can  shed  some  light. 
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Chapter  8 
Future  Work 


Graph  mining  is  still  a  very  young  discipline.  Even  though  parts  of  it  have  been  studied  in  several 
communities  for  a  while,  many  important  questions  are  unanswered  as  yet.  There  are  many  avenues 
for  future  work,  some  of  which  will  be  discussed  below. 


Mining  network  traffic  matrices.  Traffic  matrices  provide  information  about  the  flow  of  packets 
through  routers  over  time,  and  we  would  like  to  detect  outlier  traffic  (which  may  represent  Denial- 
of-Service  attacks).  While  nominally  similar  to  the  study  of  time  evolution  of  graphs,  there  are 
several  special  constraints  in  this  dataset. 

The  primary  constraint  is  speed  and  scalability:  a  router  under  heavy  load  cannot  spend  much 
time  to  process  each  packet  it  handles.  Any  mining  algorithm  must  operate  very  quickly  on  each 
packet.  However,  the  processing  power  of  a  router  is  also  constrained:  thus,  the  computations  per 
packet  must  be  quite  basic.  We  can  operate  on  only  a  sample  of  the  data,  but  the  realism  of  this 
sample  also  depends  on  the  semantics  of  the  domain:  for  example,  if  we  want  to  count  the  number 
of  open  connections,  the  sample  might  need  to  be  biased  towards  TCP  SYN  packets.  All  of  these 
issues  create  an  extremely  interesting  problem,  and  this  has  attracted  a  lot  of  recent  interest  due  to 
its  importance  in  industry. 


Clustering  on  weighted  graphs.  Our  Cross-associations  algorithm  operates  on  unweighted 
graphs  (or  equivalently,  binary  matrices).  How  can  this  be  extended  to  weighted  graphs? 

The  main  issue  is  that  defining  a  “homogeneous  block”  is  hard  in  the  weighted  context.  Again, 
perhaps  we  should  not  be  looking  for  homogeneous  blocks  at  all,  but  for  a  “realistic”  block  such 
as  one  generated  by,  say,  R-MAT.  This  changes  the  cost  function  that  we  optimize  using  the  MDL 
criterion,  and  the  heuristics  for  finding  a  good  clustering  might  also  need  modifications. 
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